sivabalan narayanan created HUDI-4216:
-----------------------------------------
Summary: Add support for infinite retention of data files with
archival enabled
Key: HUDI-4216
URL: https://issues.apache.org/jira/browse/HUDI-4216
Project: Apache Hudi
Issue Type: Improvement
Components: archiving
Reporter: sivabalan narayanan
We can support infinite retention with hudi (with archival enabled), it would
be a pretty good use-case for those who may want to query hudi table for any
time in the past.
How to achieve:
- Disable cleaner completely.
- Enable archival as usual.
- Enable metadata table and so file listing can scale well.
Let users query hudi with "as.of.timestamp" with any timestamp in the past.
With this, we can let users to retain all data for 1 year or even more and
still query for any snapshot in the past. Obviously this comes with the
additional storage cost, but if users are willing to bear the cost, we should
be able to support them.
Disabling cleaner :
option("hoodie.clean.automatic","false").
option("hoodie.clean.async","true").
Things to fix:
Replaced file groups, once removed the archiver, could become active file
groups. For eg, if clustering replaced FG_1 and FG2, HoodieTableFileSystemView
will load all file groups and then will filter out replaced file groups. FG_1
and FG_2 will be deduced as replaced if it finds a replace commit pertaining to
commits for FG_1 and FG_2 in active timeline.
In regular flow, cleaner will clean those file groups and the timeline files
may not matter after that. but here, since cleaner is completely disabled, we
need to fix this.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)