kbuci commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-4101404257

   @nsivabalan thanks, based on our prior discussions, let me summarize the 
planned approach to implement stashing
   For the below, to make this discussion DFS agnostic, lets assume that when 
we call `rename` we are internally calling a helper function that
   - Copies all data to dest, then deletes from source if not HDFS - otherwise 
does a rename
   - Is idempotent, in the sense that if its called again on the same (source, 
dest), it will handle all cases (everything in source, some files in dest etc).
   
   
   We will create a custom `SparkPreCommitValidator` that, when executed
   ```
   1. Reads the stash partition parent folder from the extraMetadata of ongoing 
operation (asserting that its delete_partition)
   2. Create  an empty map
   3. Creates a spark task for each partition. Within this task, get the source 
path (basepath/partition)  and the dest path (stash_folder/partition), creating 
the latter folders if not already created.
   3a. If the source path is completely empty , then mark the partition in the 
map as FAILED
   3b. Otherwise, call rename on source -> dest. If success, mark as SUCCESS. 
Otherwise, mark as FAILED
   4. Write out this partition->status map to some file in the basepath, maybe 
/.hoodie/.stash/<instant time> ?
   ```
   Now we can call deletePartitions API with this "validator"
   
   Note that we should check that calling `deletePartition` on partitions that 
are empty (not even having the [dot] hoodie partition metafile and/or no longer 
in MDT) will still not fail but will still replace the files in the other 
(nonempty) partitions.  If this is an issue, then we can relax (3) to force the 
validator to mark 3a as success and fail if any rename operation fails. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to