zhilinli123 opened a new issue #4881:
URL: https://github.com/apache/hudi/issues/4881
We use flink CDC to monitor mysql's latest binlog send kafka consumption
Kafka load index full import data index
Importing HDFS in batches offline Enable incremental intervention in bulk
insert mode to consume kafka binlog data
Test consumption One table in the same Kafka topic started index loading
consumption without duplicate data, but some duplicate data occurred in the
parallel consumption of multiple tables. This problem occurs many times. Each
time, duplicate data occurred after the first successful index loading, the
program has been running Each time, duplicate data successfully occurred at the
first checkpoint
The metadata fields of the two HUDI duplicates are otherwise identical
<img width="1488" alt="image"
src="https://user-images.githubusercontent.com/76689593/155283951-553e7bf1-0521-4476-bf02-f1c7ae5f3eaa.png">
Duplicate HUDI data found
<img width="1258" alt="image"
src="https://user-images.githubusercontent.com/76689593/155284057-e416e7b9-f46f-4a8d-b604-bc6b6af823a1.png">
Hudi writes the configured parameters
with('connector' = 'hudi',
'path' = 'hdfs:///prod/xxx/member',
'index.bootstrap.enabled'='true',
'compaction.tasks'='2',
'read.start-commit'='earliest',
'changelog.enabled'='true',
'write.task.max.size'='4096',
'write.bucket_assign.tasks'='1',
'compaction.delta_seconds'='120',
'compaction.delta_commits'='2',
'compaction.trigger.strategy'='num_or_time',
'compaction.max_memory'='2048',
'read.streaming.enabled'='true',
'read.tasks'='6',
'write.merge.max_memory'='1024',
'read.streaming.check-interval'='10',
'table.type'='MERGE_ON_READ',
'write.tasks'='4');
hudi version:master
flink version: 1.13.2
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]