hbgstc123 opened a new issue, #6924:
URL: https://github.com/apache/hudi/issues/6924

   append data to hudi with flink, use inline async clustering.
   _hoodie_commit_time field is update to the commit time of replacecommit.
   _hoodie_commit_time change will result in data duplication when stream read 
from this hudi table.
   
   Steps to reproduce the behavior:
   stream write data to hudi,
   enable inline clustering,
   check the _hoodie_commit_time field of data after replacecommit complete.
   
   * Hudi version : 0.12
   
   * flink version : 1.13
   
   table ddl
   `
   CREATE TEMPORARY TABLE target_hudi_table1
   (
       imp_date string,
       tag string,
       id bigint,
       name string,
       score double,
       ts timestamp(3)
   ) PARTITIONED BY (imp_date)
   WITH
   (
       -- Hudi settings
       'connector' = 'hudi',
       'path' = 'hdfs://...',
       'write.operation' = 'insert',
       'table.type' = 'COPY_ON_WRITE',
       'hoodie.table.keygenerator.class' = 
'org.apache.hudi.keygen.SimpleKeyGenerator',
       'hoodie.datasource.write.recordkey.field' = 'id',
       'write.precombine.field' = 'ts',
   
       'hive_sync.partition_extractor_class' = 
'org.apache.hudi.hive.MultiPartKeysValueExtractor',
       'hoodie.datasource.write.hive_style_partitioning' = 'true',
       
       'hoodie.metadata.enable'='false',
       'clean.retain_commits'='5',
       'clustering.async.enabled'='true',
       'clustering.schedule.enabled'='true',
       'clustering.delta_commits'='3'
   );
   
   insert into hudi_stream_write_append_mode_updatetest
   select imp_date, tag, id, name, score, ts
   from data_source;
   `
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to