sivabalan narayanan created HUDI-4216:
-----------------------------------------

             Summary: Add support for infinite retention of data files with 
archival enabled 
                 Key: HUDI-4216
                 URL: https://issues.apache.org/jira/browse/HUDI-4216
             Project: Apache Hudi
          Issue Type: Improvement
          Components: archiving
            Reporter: sivabalan narayanan


We can support infinite retention with hudi (with archival enabled), it would 
be a pretty good use-case for those who may want to query hudi table for any 
time in the past. 

 

How to achieve: 

- Disable cleaner completely. 

- Enable archival as usual. 

- Enable metadata table and so file listing can scale well. 

Let users query hudi with "as.of.timestamp" with any timestamp in the past. 

 

With this, we can let users to retain all data for 1 year or even more and 
still query for any snapshot in the past. Obviously this comes with the 
additional storage cost, but if users are willing to bear the cost, we should 
be able to support them. 

 

Disabling cleaner : 

  option("hoodie.clean.automatic","false").
  option("hoodie.clean.async","true").

 

Things to fix:

Replaced file groups, once removed the archiver, could become active file 
groups. For eg, if clustering replaced FG_1 and FG2, HoodieTableFileSystemView 
will load all file groups and then will filter out replaced file groups. FG_1 
and FG_2 will be deduced as replaced if it finds a replace commit pertaining to 
commits for FG_1 and FG_2 in active timeline. 

In regular flow, cleaner will clean those file groups and the timeline files 
may not matter after that. but here, since cleaner is completely disabled, we 
need to fix this. 

 

 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to