nsivabalan commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-3774766575

   Hey thanks for the feature request @kbuci . I have some clarifications on 
the ask. Can you help with the same. 
   
   Stash partitions:
   - Can you confirm we are only interested in latest file slices and not older 
ones. But when we restore back, we may not be able to do timetravel queries. 
Only snapshot will be feasible. Just wanted to clarify on the requirements. 
   - What relation does this have wrt "delete_partition" operation we already 
have. iIs it a add on to "delete_partition" operation, where in instead of 
nuking the contents of the partition (which is the default behavior w/ delete 
partition), here we move the contents to a new folder, but still continue to 
mark the partition as unavailable for data table consumers? btw, in this case, 
I assume stashing will be synchronous right. i.e. the partition can never be 
marked as deleted for consumers until the stashing completes successfully. 
   - What incase there are concurrent writes going into the partition of 
interest when "stashPartitions" operation is invoked?
   - Incase of MOR table, this could also mean, we back up log files as well 
and not just base files. Is my understanding right? 
   
   Restore partitions: 
   - Can we do insert_overwrite operation in this case. Do note that, commit 
times for the data might differ if we take this route after restoring. But this 
might be cleaner. If not, we might need to do special handling of updates to 
metadata table writes. With streaming writer support in 1.x, might be 
challening as well. 
   
   Requirements: 
   can you throw some more light on this requirement 
   - failures and rollbacks If the operation fails after creating a plan, then 
it should be eventually rolled back by a rollback call (as part of clean’s 
rollback of failed writes). The rollback implementation should consist of 
“undoing” all the DFS operations: after rollback is completed, any partitions 
that were attempted to be stashed should still have their (latest) data files 
and any partitions attempted to be restored should still remain empty.
   
   
   Wanted to brainstorm on some idea towards the requirement: 
   - Say we add a support for new operation called "soft_delete_partitions", 
where users can specify a list of partitions to be soft deleted. Hudi will be 
removing these partitions to be served from snapshot queries once the 
"soft_delete_partitions" operation succeeds. But the cleaner will never attempt 
to clean them up until the partitions are "hard deleted". Any partition that 
was "soft deleted" before are eligible to be "hard deleted" using another 
operation called "hard_delete_partitions". We could also think of adding TTL 
support to "soft_delete_partitions" operation. When the time elapses, it would 
automatically trigger(TTL table service) hard deletes for those partitions and 
user does not need to explicitly trigger "hard_delete_partitions". Users will 
be given an option to recover a "soft_deleted_partition", on which case, the 
partitions will be put back on rotation. Essentially, we are marking the 
partitions as unavailable for sometime, and later, either decide to unblock cle
 aner (eligible to be hard deleted), or mark the partitions as "available" for 
consumption. 
   
   This could be very efficient since there is no data movement in this case 
and we do not need to manually fix metadata table on restore etc. I am sure, we 
need to flush this out more w/ this design proposal, but wanted to get your 
thoughts. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to