[jira] [Created] (HUDI-3038) Comprehensive mechanism around cleaning the archived timeline

Yue Zhang (Jira) Thu, 16 Dec 2021 03:09:06 -0800

Yue Zhang created HUDI-3038:
-------------------------------

             Summary: Comprehensive mechanism around cleaning the archived 
timeline
                 Key: HUDI-3038
                 URL: https://issues.apache.org/jira/browse/HUDI-3038
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: Yue Zhang



At present, Hoodie's archive file grows indefinitely, which is more serious for 
dfs that does not support append.
After PR https://github.com/apache/hudi/pull/4078, now users will have some way 
to trim the archive files and not keep expanding indefinitely.
But as the document said *WARNING: do not use this config unless you know what 
you're doing. If enabled, details of older archived instants are deleted, 
resulting in information loss in the archived timeline, which may affect tools 
like CLI and repair. Only enable this if you hit severe performance issues for 
retrieving archived timeline.*

So we need a more comprehensive mechanism around cleaning the archived timeline.

(1) Rewrite the archived timeline content into a smaller number of files
(2) When deleting the archived files, make sure the table does not have any 
corresponding base or log files from the contained instants, so there is 
essentially no information loss of the table states.

As we know this two operation is pretty heavy, maybe we could build a new tool 
instead of a inner service to make it happen.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Created] (HUDI-3038) Comprehensive mechanism around cleaning the archived timeline

Reply via email to