[ 
https://issues.apache.org/jira/browse/HUDI-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17461441#comment-17461441
 ] 

sivabalan narayanan commented on HUDI-2909:
-------------------------------------------

Here is what we can do here. 

non-row writer path and row writer path spits out diff values for timestamp 
column with logical type which is what needs to be fixed.

There could be 3 types of users who might be using hudi. 

a. using non-writer paths fully. 

b. using row writer paths fully (immutable data)

c. using a mix of non row writer and row writer. 

 

Atleast we want to ensure users for (a) and (b) have a way to continue using as 
is. (c) anyways their data may not be consistent, we can give them a way to 
migrate. 

 

Having said this, here is what we can do. 

Introduce a new config named 
"hoodie.generate.consistent.timestamp.logical.for.key.generator" (we can debate 
the naming)

And when this flag is enabled, both row writer and non-row writer path should 
be generating consistent values. If not enabled, we will fallback to existing 
behavior. 

So, (a) and (b) users above can choose to stay as is w/o enabling the config. 

(c) users can either recreate their dataset and by enabling this new config. 
since their existing dataset might have duplicates anyways. 

 

Only reason I don't want to enable the config by default is, either (a) or (b) 
users will run into inconsistencies w/o knowing them only. 

 

> KeyGenerator is broken in 0.10.0
> --------------------------------
>
>                 Key: HUDI-2909
>                 URL: https://issues.apache.org/jira/browse/HUDI-2909
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: DeltaStreamer
>            Reporter: Harsha Teja Kanna
>            Assignee: Sagar Sumit
>            Priority: Blocker
>              Labels: core-flow-ds, pull-request-available, sev:critical
>             Fix For: 0.11.0
>
>
> Existing table has timebased keygen config show below
> hoodie.deltastreamer.keygen.timebased.timestamp.type=SCALAR
> hoodie.deltastreamer.keygen.timebased.output.timezone=GMT
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd
> hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit=MICROSECONDS
> hoodie.deltastreamer.keygen.timebased.input.timezone=GMT
> hoodie.datasource.write.partitionpath.field=lastdate:timestamp
> hoodie.datasource.write.operation=upsert
> hoodie.deltastreamer.transformer.sql=SELECT session.id, session.rid, 
> session.mid, to_timestamp(session.lastdate) as lastdate, 
> to_timestamp(session.updatedate) as updatedate FROM <SRC> a
>  
> Upgrading to 0.10.0 from 0.9.0 fails with exception 
> org.apache.hudi.exception.HoodieKeyGeneratorException: Unable to parse input 
> partition field :2021-12-01 10:13:34.702
> Caused by: org.apache.hudi.exception.HoodieNotSupportedException: Unexpected 
> type for partition field: java.sql.Timestamp
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:211)
> at 
> org.apache.hudi.keygen.TimestampBasedAvroKeyGenerator.getPartitionPath(TimestampBasedAvroKeyGenerator.java:133)
> *Workaround fix:*
> Reverting this 
> https://github.com/apache/hudi/pull/3944/files#diff-22fb52b5cf28727ba23cb8bd4be820432a4e396ce663ac472a4677e889b7491eR543
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to