[GitHub] [hudi] zhuanshenbsj1 commented on pull request #7401: [HUDI-5341] IncrementalCleaning consider later clustering

GitBox Wed, 07 Dec 2022 18:32:37 -0800


zhuanshenbsj1 commented on PR #7401:
URL: https://github.com/apache/hudi/pull/7401#issuecomment-1341895878


   > Somehow i got the point why the data set duplication occurs, it is not 
because of the out of order execution of clustering, but because the fs view 
with clustering instants relies on the replace commit metadata to composite the 
file snapshots with replaced file handles, if we archive a clustering instant 
that has not been cleaned yet, the replace commit metadata is gone and the 
duplicates happens.
   > 
   > One gold rule for the clustering archiving is: we can only archive the 
commit only when we make sure it's replaced instant has been cleaned 
successfully, while this should be very hard i guess because one clustering 
commit may replace multiple normal commits. A better way is to look up the 
archive timeline when there are clustering instants on the timeline.
   > 
   > WDYT, @nsivabalan :)
   
   How abourt add a check before archive， require every instant to archive  
completed time  must  earlier than the latest cleaning instant start clean time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] zhuanshenbsj1 commented on pull request #7401: [HUDI-5341] IncrementalCleaning consider later clustering

Reply via email to