[
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461441#comment-17461441
]
sivabalan narayanan commented on HUDI-2909:
-------------------------------------------
Here is what we can do here.
non-row writer path and row writer path spits out diff values for timestamp
column with logical type which is what needs to be fixed.
There could be 3 types of users who might be using hudi.
a. using non-writer paths fully.
b. using row writer paths fully (immutable data)
c. using a mix of non row writer and row writer.
Atleast we want to ensure users for (a) and (b) have a way to continue using as
is. (c) anyways their data may not be consistent, we can give them a way to
migrate.
Having said this, here is what we can do.
Introduce a new config named
"hoodie.generate.consistent.timestamp.logical.for.key.generator" (we can debate
the naming)
And when this flag is enabled, both row writer and non-row writer path should
be generating consistent values. If not enabled, we will fallback to existing
behavior.
So, (a) and (b) users above can choose to stay as is w/o enabling the config.
(c) users can either recreate their dataset and by enabling this new config.
since their existing dataset might have duplicates anyways.
Only reason I don't want to enable the config by default is, either (a) or (b)
users will run into inconsistencies w/o knowing them only.
> KeyGenerator is broken in 0.10.0
> --------------------------------
>
> Key: HUDI-2909
> URL: https://issues.apache.org/jira/browse/HUDI-2909
> Project: Apache Hudi
> Issue Type: Bug
> Components: DeltaStreamer
> Reporter: Harsha Teja Kanna
> Assignee: Sagar Sumit
> Priority: Blocker
> Labels: core-flow-ds, pull-request-available, sev:critical
> Fix For: 0.11.0
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid,
> session.mid, to_timestamp(session.lastdate) as lastdate,
> to_timestamp(session.updatedate) as updatedate FROM <SRC> a
>
> Upgrading to 0.10.0 from 0.9.0 fails with exception
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected
> type for partition field: java.sql.Timestamp
> at
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)