konwu created HUDI-3286:
---------------------------

             Summary: duplicate records when flink task restart with 
index.bootstrap=true
                 Key: HUDI-3286
                 URL: https://issues.apache.org/jira/browse/HUDI-3286
             Project: Apache Hudi
          Issue Type: Bug
          Components: flink
            Reporter: konwu


    In our company we use cow table type and use flink always with enable 
index.bootstrap=true.

I found some duplicate records when flink task restart  . Some abnormal log

 
./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:19,016 INFO  
org.apache.hudi.sink.partitioner.BucketAssigner              [] - For 
partitionPath :  Small Files => [SmallFile \{location=HoodieRecordLocation 
{instantTime=20220110110939, fileId=2d1b050f-5610-4c0a-b15c-3c2d5a9affe3}, 
sizeBytes=41992073}, SmallFile \{location=HoodieRecordLocation 
{instantTime=20220110110939, fileId=3c349304-e012-4915-b59d-a3bfca18c218}, 
sizeBytes=3658074}]


./hadoop-052-096.th.bigdata.ly_28867:2022-01-10 11:30:15,955 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 5.
./hadoop-052-096.th.bigdata.ly_28867:2022-01-10 11:30:19,794 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 3.
./hadoop-014-044.th.bigdata.ly_42121:2022-01-10 11:30:31,459 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 4.
./hadoop-014-044.th.bigdata.ly_42121:2022-01-10 11:30:38,706 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 0.
./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:41,592 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 2.
./hadoop-014-018.th.bigdata.ly_22259:2022-01-10 11:30:47,130 INFO  
org.apache.hudi.sink.bootstrap.BootstrapFunction             [] - Finish 
sending index records, taskId = 1.
 
BucketAssigner is processing data before all index bootstrap done
 
It is because current restart use last GlobalAggregate ,It could be add some 
suffix to avoid this



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to