zhuanshenbsj1 commented on PR #7401:
URL: https://github.com/apache/hudi/pull/7401#issuecomment-1341887845

   > Somehow i got the point why the data set duplication occurs, it is not 
because of the out of order execution of clustering, but because the fs view 
with clustering instants relies on the replace commit metadata to composite the 
file snapshots with replaced file handles, if we archive a clustering instant 
that has not been cleaned yet, the replace commit metadata is gone and the 
duplicates happens.
   > 
   > One gold rule for the clustering archiving is: we can only archive the 
commit only when we make sure it's replaced instant has been cleaned 
successfully, while this should be very hard i guess because one clustering 
commit may replace multiple normal commits. A better way is to look up the 
archive timeline when there are clustering instants on the timeline.
   > 
   > WDYT, @nsivabalan :)
   
   how about when archive instant,  check the instant completed time before 
cleaning plan generate time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to