kbuci commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-4042852348

   Thanks @nsivabalan for sharing! I had some initial questions
   - For HDFS in step (3) after the rename we will also need to delete files 
that aren't in latest file slice right (and similarly in step 1 for cloud we 
only want to copy over those files) ?
   - Having a user-provided checkpoint folder to store checkpoints should 
resolve the issues of retries. But I am still worried about a (very unlikely) 
edge case  - in between step 2 and 3, if the writer gets stuck for a long time 
and a bunch of writes and cleans happen, then it's technically possible that by 
the time we get to step 3 the target partition(s) have already had their files 
fully deleted by a clean. I'm trying think of a way we can prevent this (either 
in the initial design or as a follow-up), and I don't think we can leverage 
checkpointing here. Could we maybe do (3) as a pre-commit operation to the 
`deletePartitions` call (before committing the deletePartitions instant)? Since 
that way `clean` will anyway be blocked by this deletePartition instant (until 
it commits). Or does that expose us to some other edge case?  
   - Would it be feasible to do @prashantwason's suggestion of doing these 
pre/post write steps via pre/post commit "hooks"? Since if we allowed users to 
pass in custom functions before/after the deletePartitions API call (and an 
extraMetadata for the "checkpoint" info) then we might not necessarily need 
this as a separate utility procedure. But rather just as implementations of 
those "hooks".
   - Although for our use case we expect stash to succeed or be at least 
re-attempted almost all the time, we should also be able to support a future 
enhancement of having a way to "automatically clean up" stuck/failed stash 
attempts. Unfortunately though it might not make sense to update 
`rollbackFailedWrites` in OSS to do this, since this is anyway a custom 
utility/setup. Could we add support for being able to pass a custom 
user-provided class/function to `rollbackFailedWrites` ? Since that way if in 
the future we want to have that do auto cleanup/re-attempts then we can 
implement that internally if needed?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to