[GitHub] [hudi] hbgstc123 commented on pull request #8950: [HUDI-6366] Prevent flink offline table service rerun completed instant

via GitHub Tue, 13 Jun 2023 09:46:09 -0700


hbgstc123 commented on PR #8950:
URL: https://github.com/apache/hudi/pull/8950#issuecomment-1589678133


   > > if not archived, then there will be duplicated base files.
   > 
   > How are these duplicates generated?
   
   For compaction, if the first run compact `fileID1_timestamp1.log` and 
`fileID1_0-1-0_timestamp1.parquet`, genrate `fileID1_0-1-0_timestamp2.parquet`, 
the job fail after compaction committed, then job failover and rerun this 
compaction instant, this second run will again compact fileID1_timestamp1.log 
and `fileID1_0-1-0_timestamp1.parquet`, but genrate 
`fileID1_0-1-1_timestamp2.parquet`, then fail to complete because its already 
completed in the first run. These 2 files `fileID1_0-1-0_timestamp2.parquet` 
and `fileID1_0-1-1_timestamp2.parquet`.
   
   Same for clustering but worse, because the second run will generate a 
parquet file with a new file group, when you read from the table again the 
result will be wrong.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] hbgstc123 commented on pull request #8950: [HUDI-6366] Prevent flink offline table service rerun completed instant

Reply via email to