[ https://issues.apache.org/jira/browse/HUDI-80?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-80: ------------------------------- Status: Patch Available (was: In Progress) > Incrementalize cleaning based on timeline metadata > -------------------------------------------------- > > Key: HUDI-80 > URL: https://issues.apache.org/jira/browse/HUDI-80 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Performance, Write Client > Reporter: Vinoth Chandar > Assignee: Balaji Varadarajan > Priority: Major > Labels: pull-request-available > Fix For: 0.5.1 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, cleaning lists all partitions once and then picks the file groups > to clean from DFS. This is partly due to support for retaining last x > versions of a file group as well (in additon to the default mode of retaining > last x commits). This could be expensive in some cases. See > [https://github.com/apache/incubator-hudi/issues/613] for a issue reported. > > This task tracks work to > * Determine if we can get rid of last X version cleaning mode > * Implement cleaning based on file metadata in hudi timeline itself > * Resulting rpc calls to DFS would be O(number of filegroups > cleaned)/O(number of partitions touched in last X commits) > > HUDI-1 implements a timeline service for writing, that promotes caching of > file system metadata. This can be implemented on top of that. -- This message was sent by Atlassian Jira (v8.3.4#803005)