kbuci commented on issue #17866: URL: https://github.com/apache/hudi/issues/17866#issuecomment-4042852348
Thanks @nsivabalan for sharing! I had some initial questions - For HDFS in step (3) after the rename we will also need to delete files that aren't in latest file slice right (and similarly in step 1 for cloud we only want to copy over those files) ? - Having a user-provided checkpoint folder to store checkpoints should resolve the issues of retries. But I am still worried about a (very unlikely) edge case - in between step 2 and 3, if the writer gets stuck for a long time and a bunch of writes and cleans happen, then it's technically possible that by the time we get to step 3 the target partition(s) have already had their files fully deleted by a clean. I'm trying think of a way we can prevent this (either in the initial design or as a follow-up), and I don't think we can leverage checkpointing here. Could we maybe do (3) as a pre-commit operation to the `deletePartitions` call (before committing the deletePartitions instant)? Since that way `clean` will anyway be blocked by this deletePartition instant (until it commits). Or does that expose us to some other edge case? - Would it be feasible to do @prashantwason's suggestion of doing these pre/post write steps via pre/post commit "hooks"? Since if we allowed users to pass in custom functions before/after the deletePartitions API call (and an extraMetadata for the "checkpoint" info) then we might not necessarily need this as a separate utility procedure. But rather just as implementations of those "hooks". - Although for our use case we expect stash to succeed or be at least re-attempted almost all the time, we should also be able to support a future enhancement of having a way to "automatically clean up" stuck/failed stash attempts. Unfortunately though it might not make sense to update `rollbackFailedWrites` in OSS to do this, since this is anyway a custom utility/setup. Could we add support for being able to pass a custom user-provided class/function to `rollbackFailedWrites` ? Since that way if in the future we want to have that do auto cleanup/re-attempts then we can implement that internally if needed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
