nsivabalan commented on issue #17866: URL: https://github.com/apache/hudi/issues/17866#issuecomment-3774766575
Hey thanks for the feature request @kbuci . I have some clarifications on the ask. Can you help with the same. Stash partitions: - Can you confirm we are only interested in latest file slices and not older ones. But when we restore back, we may not be able to do timetravel queries. Only snapshot will be feasible. Just wanted to clarify on the requirements. - What relation does this have wrt "delete_partition" operation we already have. iIs it a add on to "delete_partition" operation, where in instead of nuking the contents of the partition (which is the default behavior w/ delete partition), here we move the contents to a new folder, but still continue to mark the partition as unavailable for data table consumers? btw, in this case, I assume stashing will be synchronous right. i.e. the partition can never be marked as deleted for consumers until the stashing completes successfully. - What incase there are concurrent writes going into the partition of interest when "stashPartitions" operation is invoked? - Incase of MOR table, this could also mean, we back up log files as well and not just base files. Is my understanding right? Restore partitions: - Can we do insert_overwrite operation in this case. Do note that, commit times for the data might differ if we take this route after restoring. But this might be cleaner. If not, we might need to do special handling of updates to metadata table writes. With streaming writer support in 1.x, might be challening as well. Requirements: can you throw some more light on this requirement - failures and rollbacks If the operation fails after creating a plan, then it should be eventually rolled back by a rollback call (as part of clean’s rollback of failed writes). The rollback implementation should consist of “undoing” all the DFS operations: after rollback is completed, any partitions that were attempted to be stashed should still have their (latest) data files and any partitions attempted to be restored should still remain empty. Wanted to brainstorm on some idea towards the requirement: - Say we add a support for new operation called "soft_delete_partitions", where users can specify a list of partitions to be soft deleted. Hudi will be removing these partitions to be served from snapshot queries once the "soft_delete_partitions" operation succeeds. But the cleaner will never attempt to clean them up until the partitions are "hard deleted". Any partition that was "soft deleted" before are eligible to be "hard deleted" using another operation called "hard_delete_partitions". We could also think of adding TTL support to "soft_delete_partitions" operation. When the time elapses, it would automatically trigger(TTL table service) hard deletes for those partitions and user does not need to explicitly trigger "hard_delete_partitions". Users will be given an option to recover a "soft_deleted_partition", on which case, the partitions will be put back on rotation. Essentially, we are marking the partitions as unavailable for sometime, and later, either decide to unblock cle aner (eligible to be hard deleted), or mark the partitions as "available" for consumption. This could be very efficient since there is no data movement in this case and we do not need to manually fix metadata table on restore etc. I am sure, we need to flush this out more w/ this design proposal, but wanted to get your thoughts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
