kbuci opened a new issue, #17866: URL: https://github.com/apache/hudi/issues/17866
### Task Description **What needs to be done:** - Add a "stashPartitions" operation in spark engine `stashPartitions(List<String> partitionPaths, Path backupFolder)`. It will attempt to create a folder in `backupFolder` for each partitions path in `partitionPaths`, where each folder contains all the latest committted data files (without any replaced/compacted/uncleaned files) from the corresponding partition in the basepath. And then all files in the dataset basepath partition will be removed. For example, `stashPartitions(['datestr=2023-01-01'], "/backup/folder")` will create a folder in `/backup/folder/datestr=2023-01-01` containing all the latest data files from that partition in the dataset. If a given partition folder in the dataset was already stashed, it will be skipped. It will return a `HoodieWriteMetadata` that lists each partition and wether it "succeeded" stashing or was "skipped". - Add a "restorePartitions" operation in spark engine `restorePartitions(List<String> partitionPaths, Path backupFolder)`. It works as the "reverse" of stashPartitions; it will attempt to create a partition in the dataset basepath for each partitions path in `partitionPaths`, where each partition contains all the data files from the corresponding partition folder in the `backupFolder`. For example, `restorePartitions(['datestr=2023-01-01'], "/backup/folder")` will create`/<basepath>/datestr=2023-01-01` containing all the data files from `/<basepath>/datestr=2023-01-01`. If a given partition folder in the dataset was already restored, it will be skipped. It will return a `HoodieWriteMetadata` that lists each partition and wether it "succeeded" in restore or was "skipped". Assumptions: - Once the user attempts to stash a partition, it must never be written to again. - Once a partition folder in the `backupFolder` has been part of a successful `restorePartitions`, it can be deleted. Requirements: - **data consistency** Upon successful stash, Data files from deleted partitions should be removed from internal metadata like MDT/indexes (similar to to other HUDI operations) and queries against the partition should return no data. - **usability** Stashing should recursively created folders in `backupFolder` as needed for partitions with nested folders, like `hour=x/minute=y` - **immediate stash cleanup** When a stash operation commits, all files from the dataset partition (include HUDI internal files like .hoodie_partition_metadata) should have been deleted. - **fast rename for support filesystems** For DFS like HDFS which support atomic rename of folders without copies, we should utilize DFS `rename` APIs to avoid having to wait for each file to be copied - **failures and rollbacks** If the operation fails after creating a plan, then it should be eventually rolled back by a rollback call (as part of clean’s rollback of failed writes). The rollback implementation should consist of “undoing” all the DFS operations: any partitions that were attempted to be stashed should still have their (latest) data files, and any partitions attempted to be restored should still remain empty. - **retries and concurrent writes** Before scheduling the plan, `stashPartitions` should start a heartbeat and check if there are any inflight writes targeting the same partition. It should attempt a rollback of any other `stashPartitions` plans with expired heartbeats. If any inflight instants still remain after that, then it should raise an exception. The same behavior should apply to `restorePartitions`, except it in its case it should only attempt rollback of other `restorePartition` instants **Why this task is needed:** For our use case, when we apply TTL to older partitions of datasets we need to - Stash them to a separate location for a grace period, in case users request them to be added back - Ensure that we have an API that will synchronously remove all DFS "objects" in the TTL-ed partition folder. Since even disregarding store space, we need to "clean up" objects/inodes ### Task Type Other ### Related Issues **Parent feature issue:** (if applicable ) **Related issues:** NOTE: Use `Relationships` button to add parent/blocking issues after issue is created. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
