kbuci commented on issue #17866: URL: https://github.com/apache/hudi/issues/17866#issuecomment-3809295856
After thinking about it some more, I actually still think `immediate stash cleanup` (ensuring the original data in DFS partition and MDT/indexes is deleted when stashing is complete) is a hard requirement for the stash API. And that we should not rely on HUDI clean to do the actual deletion later, regardless of wether we put data into stashed folder via DFS rename or a regular copy. This is since if there are no other writes to the dataset other than stash jobs and the cleaner policy is "retain last n commits", then the partitions in the last n stash commits will never have their data being deleted/cleaned up. So in order to ensure there isn't a large unbounded delay in between stash operation being completed and partition data being removed from dataset, I think its safe to decouple stashing from HUDI clean. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
