nsivabalan commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-4058376673

   Q: For HDFS in step (3) after the rename we will also need to delete files 
that aren't in latest file slice right (and similarly in step 1 for cloud we 
only want to copy over those files) ?
   
   A: yes. we could do this as step3 and w/ checkpoint, we should be able to do 
this in a reliable manner even w/ failures. 
   
   Q: Having a user-provided checkpoint folder to store checkpoints should 
resolve the issues of retries. But I am still worried about a (very unlikely) 
edge case - in between step 2 and 3, if the writer gets stuck for a long time 
and a bunch of writes and cleans happen, then it's technically possible that by 
the time we get to step 3 the target partition(s) have already had their files 
fully deleted by a clean. I'm trying think of a way we can prevent this (either 
in the initial design or as a follow-up), and I don't think we can leverage 
checkpointing here. Could we maybe do (3) as a pre-commit operation to the 
deletePartitions call (before committing the deletePartitions instant)? Since 
that way clean will anyway be blocked by this deletePartition instant (until it 
commits). Or does that expose us to some other edge case?
   
   Are you talking about hdfs or cloud storage? 
   if `deletePartition` has not completed, how can a clean operation delete the 
files? I am not sure I get your question. can you clarify please. 
   
   We can go over rest of the questions once I get clarity on the above. 
   
   General idea is that, this is an admin operation, and does not warrant to 
make any changes to timeline or the writeClient machinery. That way, we don't 
try to manipulate any files in an adhoc manner which goes against what a write 
operation is expected to do(which is what a good user is expected to operate 
Hudi as). (For eg, if w/n `deletePartition`, if we are suggesting to pre commit 
validator hook to delete data files). 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to