yihua commented on issue #9026: URL: https://github.com/apache/hudi/issues/9026#issuecomment-1631242762
> Did you add new partition during that step ? It turns out the duplication occurs when new partitions are added after compaction. see below: when no new partitions, no duplication. When new partitions, then it gets tons of duplicates. Yes. What I did is that after compaction in MDT finished, I killed the write job. Usually, the MDT compaction happens before a deltacommit in the MDT so there will always be a deltacommit following the compaction commit in MDT after a complete data table transaction. See these two sample tables: [test_corrupted_mdt_compaction_latest.tar.gz](https://github.com/apache/hudi/files/12017589/test_corrupted_mdt_compaction_latest.tar.gz) : compaction commit is the latest in MDT ``` 20230708221954673.deltacommit 20230708221954673.deltacommit.inflight 20230708221954673.deltacommit.requested 20230708221954673001.commit 20230708221954673001.compaction.inflight 20230708221954673001.compaction.requested ``` [test_corrupted_mdt_compaction_and_commit.tar.gz](https://github.com/apache/hudi/files/12017590/test_corrupted_mdt_compaction_and_commit.tar.gz): there is one more deltacommit adding a new partition. ``` 20230708221954673.deltacommit 20230708221954673.deltacommit.inflight 20230708221954673.deltacommit.requested 20230708221954673001.commit 20230708221954673001.compaction.inflight 20230708221954673001.compaction.requested 20230708235345986.deltacommit 20230708235345986.deltacommit.inflight 20230708235345986.deltacommit.requested ``` The Spark datasource read on the first MDT does not return duplicates, while the Spark datasource read on the second MDT returns duplicates. My suspect is that the HFile itself does not contain duplicates, but either the merging or the MOR snapshot relation in Spark has issue, causing the duplicates. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
