kbuci commented on issue #17866:
URL: https://github.com/apache/hudi/issues/17866#issuecomment-3809295856

   After thinking about it some more, I actually still think `immediate stash 
cleanup` (ensuring the original data in DFS partition and MDT/indexes is 
deleted when stashing is complete) is a hard requirement for the stash API. And 
that we should not rely on HUDI clean to do the actual deletion later, 
regardless of wether we put data into stashed folder via DFS rename or a 
regular copy. This is since if there are no other writes to the dataset other 
than stash jobs and the cleaner policy is "retain last n commits", then the 
partitions in the last n stash commits will never have their data being 
deleted/cleaned up. So in order to ensure there isn't a large unbounded delay 
in between stash operation being completed and partition data being removed 
from dataset, I think its safe to decouple stashing from HUDI clean.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to