hbgstc123 commented on PR #8950: URL: https://github.com/apache/hudi/pull/8950#issuecomment-1589678133
> > if not archived, then there will be duplicated base files. > > How are these duplicates generated? For compaction, if the first run compact `fileID1_timestamp1.log` and `fileID1_0-1-0_timestamp1.parquet`, genrate `fileID1_0-1-0_timestamp2.parquet`, the job fail after compaction committed, then job failover and rerun this compaction instant, this second run will again compact fileID1_timestamp1.log and `fileID1_0-1-0_timestamp1.parquet`, but genrate `fileID1_0-1-1_timestamp2.parquet`, then fail to complete because its already completed in the first run. These 2 files `fileID1_0-1-0_timestamp2.parquet` and `fileID1_0-1-1_timestamp2.parquet`. Same for clustering but worse, because the second run will generate a parquet file with a new file group, when you read from the table again the result will be wrong. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
