[GitHub] [hudi] zhilinli123 opened a new issue #4881: Full incremental Enable index loading to discover duplicate data

GitBox Wed, 23 Feb 2022 00:32:23 -0800


zhilinli123 opened a new issue #4881:
URL: https://github.com/apache/hudi/issues/4881



   We use flink CDC to monitor mysql's latest binlog send kafka consumption 
Kafka load index full import data index
   
   Importing HDFS in batches offline Enable incremental intervention in bulk 
insert mode to consume kafka binlog data
   
    Test consumption One table in the same Kafka topic started index loading 
consumption without duplicate data, but some duplicate data occurred in the 
parallel consumption of multiple tables. This problem occurs many times. Each 
time, duplicate data occurred after the first successful index loading, the 
program has been running Each time, duplicate data successfully occurred at the 
first checkpoint
   
   The metadata fields of the two HUDI duplicates are otherwise identical
   <img width="1488" alt="image" 
src="https://user-images.githubusercontent.com/76689593/155283951-553e7bf1-0521-4476-bf02-f1c7ae5f3eaa.png";>
   
    Duplicate HUDI data found
   <img width="1258" alt="image" 
src="https://user-images.githubusercontent.com/76689593/155284057-e416e7b9-f46f-4a8d-b604-bc6b6af823a1.png";>
   
   Hudi writes the configured parameters
   with('connector' = 'hudi',
   'path' = 'hdfs:///prod/xxx/member',
   'index.bootstrap.enabled'='true',
   'compaction.tasks'='2',
   'read.start-commit'='earliest',
   'changelog.enabled'='true',
   'write.task.max.size'='4096',
   'write.bucket_assign.tasks'='1',
   'compaction.delta_seconds'='120',
   'compaction.delta_commits'='2',
   'compaction.trigger.strategy'='num_or_time',
   'compaction.max_memory'='2048',
   'read.streaming.enabled'='true',
   'read.tasks'='6',
   'write.merge.max_memory'='1024',
   'read.streaming.check-interval'='10',
   'table.type'='MERGE_ON_READ',
   'write.tasks'='4');
   
   hudi version:master
   flink version: 1.13.2
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhilinli123 opened a new issue #4881: Full incremental Enable index loading to discover duplicate data

Reply via email to