Yue Zhang created HUDI-3038:
-------------------------------
Summary: Comprehensive mechanism around cleaning the archived
timeline
Key: HUDI-3038
URL: https://issues.apache.org/jira/browse/HUDI-3038
Project: Apache Hudi
Issue Type: Improvement
Reporter: Yue Zhang
At present, Hoodie's archive file grows indefinitely, which is more serious for
dfs that does not support append.
After PR https://github.com/apache/hudi/pull/4078, now users will have some way
to trim the archive files and not keep expanding indefinitely.
But as the document said *WARNING: do not use this config unless you know what
you're doing. If enabled, details of older archived instants are deleted,
resulting in information loss in the archived timeline, which may affect tools
like CLI and repair. Only enable this if you hit severe performance issues for
retrieving archived timeline.*
So we need a more comprehensive mechanism around cleaning the archived timeline.
(1) Rewrite the archived timeline content into a smaller number of files
(2) When deleting the archived files, make sure the table does not have any
corresponding base or log files from the contained instants, so there is
essentially no information loss of the table states.
As we know this two operation is pretty heavy, maybe we could build a new tool
instead of a inner service to make it happen.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)