zhuanshenbsj1 commented on PR #7401: URL: https://github.com/apache/hudi/pull/7401#issuecomment-1341895878
> Somehow i got the point why the data set duplication occurs, it is not because of the out of order execution of clustering, but because the fs view with clustering instants relies on the replace commit metadata to composite the file snapshots with replaced file handles, if we archive a clustering instant that has not been cleaned yet, the replace commit metadata is gone and the duplicates happens. > > One gold rule for the clustering archiving is: we can only archive the commit only when we make sure it's replaced instant has been cleaned successfully, while this should be very hard i guess because one clustering commit may replace multiple normal commits. A better way is to look up the archive timeline when there are clustering instants on the timeline. > > WDYT, @nsivabalan :) How abourt add a check before archiveļ¼ require every instant to archive completed time must earlier than the latest cleaning instant start clean time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
