hudi-bot opened a new issue, #14959:
URL: https://github.com/apache/hudi/issues/14959

   At present, Hoodie's archive file grows indefinitely, which is more serious 
for dfs that does not support append.
   After PR https://github.com/apache/hudi/pull/4078, now users will have some 
way to trim the archive files and not keep expanding indefinitely.
   But as the document said *WARNING: do not use this config unless you know 
what you're doing. If enabled, details of older archived instants are deleted, 
resulting in information loss in the archived timeline, which may affect tools 
like CLI and repair. Only enable this if you hit severe performance issues for 
retrieving archived timeline.*
   
   So we need a more comprehensive mechanism around cleaning the archived 
timeline.
   
   (1) Rewrite the archived timeline content into a smaller number of files
   (2) When deleting the archived files, make sure the table does not have any 
corresponding base or log files from the contained instants, so there is 
essentially no information loss of the table states.
   
   As we know this two operation is pretty heavy, maybe we could build a new 
tool instead of a inner service to make it happen.
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-3038
   - Type: Improvement


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to