nsivabalan commented on issue #17866: URL: https://github.com/apache/hudi/issues/17866#issuecomment-4058376673
Q: For HDFS in step (3) after the rename we will also need to delete files that aren't in latest file slice right (and similarly in step 1 for cloud we only want to copy over those files) ? A: yes. we could do this as step3 and w/ checkpoint, we should be able to do this in a reliable manner even w/ failures. Q: Having a user-provided checkpoint folder to store checkpoints should resolve the issues of retries. But I am still worried about a (very unlikely) edge case - in between step 2 and 3, if the writer gets stuck for a long time and a bunch of writes and cleans happen, then it's technically possible that by the time we get to step 3 the target partition(s) have already had their files fully deleted by a clean. I'm trying think of a way we can prevent this (either in the initial design or as a follow-up), and I don't think we can leverage checkpointing here. Could we maybe do (3) as a pre-commit operation to the deletePartitions call (before committing the deletePartitions instant)? Since that way clean will anyway be blocked by this deletePartition instant (until it commits). Or does that expose us to some other edge case? Are you talking about hdfs or cloud storage? if `deletePartition` has not completed, how can a clean operation delete the files? I am not sure I get your question. can you clarify please. We can go over rest of the questions once I get clarity on the above. General idea is that, this is an admin operation, and does not warrant to make any changes to timeline or the writeClient machinery. That way, we don't try to manipulate any files in an adhoc manner which goes against what a write operation is expected to do(which is what a good user is expected to operate Hudi as). (For eg, if w/n `deletePartition`, if we are suggesting to pre commit validator hook to delete data files). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
