kbuci opened a new issue, #17866:
URL: https://github.com/apache/hudi/issues/17866

   ### Task Description
   
   **What needs to be done:**
   
   - Add a "stashPartitions" operation in spark engine 
`stashPartitions(List<String> partitionPaths, Path backupFolder)`. It will 
attempt to create a folder in `backupFolder` for each partitions path in 
`partitionPaths`, where each folder contains all the latest committted data 
files (without any replaced/compacted/uncleaned files) from the corresponding 
partition in the basepath. And then all files in the dataset basepath partition 
will be removed. For example, `stashPartitions(['datestr=2023-01-01'], 
"/backup/folder")` will create a folder in `/backup/folder/datestr=2023-01-01` 
containing all the latest data files from that partition in the dataset. If a 
given partition folder in the dataset was already stashed, it will be skipped. 
It will return a `HoodieWriteMetadata` that lists each partition and wether it 
"succeeded" stashing or was "skipped".
   - Add a "restorePartitions" operation in spark engine  
`restorePartitions(List<String> partitionPaths, Path backupFolder)`. It works 
as the "reverse" of stashPartitions; it will attempt to create a partition in 
the dataset basepath for each partitions path in `partitionPaths`, where each 
partition contains all the data files from the corresponding partition folder 
in the `backupFolder`. For example, `restorePartitions(['datestr=2023-01-01'], 
"/backup/folder")` will create`/<basepath>/datestr=2023-01-01` containing all 
the data files from `/<basepath>/datestr=2023-01-01`. If a given partition 
folder in the dataset was already restored, it will be skipped. It will return 
a `HoodieWriteMetadata` that lists each partition and wether it "succeeded" in 
restore or was "skipped".
   
   Assumptions:
   - Once the user attempts to stash a partition, it must never be written to 
again.
   - Once a partition folder in the `backupFolder` has been part of a 
successful `restorePartitions`, it can be deleted.
   
   Requirements:
   - **data consistency** Upon successful stash, Data files from deleted 
partitions should be removed from internal metadata like MDT/indexes (similar 
to to other HUDI operations) and queries against the partition should return no 
data. 
   - **usability** Stashing should recursively created folders in 
`backupFolder` as needed for partitions with nested folders, like 
`hour=x/minute=y`
   - **immediate stash cleanup** When a stash operation commits, all files from 
the dataset partition (include HUDI internal files like 
.hoodie_partition_metadata) should have been deleted.
   - **fast rename for support filesystems** For DFS like HDFS which support 
atomic rename of folders without copies, we should utilize DFS `rename` APIs to 
avoid having to wait for each file to be copied
   - **failures and rollbacks** If the operation fails after creating a plan, 
then it should be eventually rolled back by a rollback call (as part of clean’s 
rollback of failed writes). The rollback implementation should consist of 
“undoing” all the DFS operations: any partitions that were attempted to be 
stashed should still have their (latest) data files, and any partitions 
attempted to be restored should still remain empty.
   - **retries and concurrent writes** Before scheduling the plan, 
`stashPartitions` should start a heartbeat and check if there are any inflight 
writes targeting the same partition. It should attempt a rollback of any other 
`stashPartitions` plans with expired heartbeats. If any inflight instants still 
remain after that, then it should raise an exception. The same behavior should 
apply to `restorePartitions`, except it in its case it should only attempt 
rollback of other `restorePartition` instants
   
   
   
   
   **Why this task is needed:**
   For our use case, when we apply TTL to older partitions of datasets we need 
to 
   - Stash them to a separate location for a grace period, in case users 
request them to be added back
   - Ensure that we have an API that will synchronously remove all DFS 
"objects" in the TTL-ed partition folder. Since even disregarding store space, 
we need to "clean up" objects/inodes
   
   ### Task Type
   
   Other
   
   ### Related Issues
   
   **Parent feature issue:** (if applicable )
   **Related issues:**
   NOTE: Use `Relationships` button to add parent/blocking issues after issue 
is created.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to