[ https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971747#comment-16971747 ]
Balaji Varadarajan commented on HUDI-309: ----------------------------------------- [https://github.com/apache/incubator-hudi/blob/23b303e4b17c5f7b603900ee5b0d2e6718118014/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L860] {code:java} if (!table.getActiveTimeline().getCleanerTimeline().empty()) { logger.info("Cleaning up older rollback meta files"); // Cleanup of older cleaner meta files // TODO - make the commit archival generic and archive rollback metadata FSUtils.deleteOlderRollbackMetaFiles(fs, table.getMetaClient().getMetaPath(), table.getActiveTimeline().getRollbackTimeline().getInstants()); } {code} As part of PR-942, the above code is removed as it is handled elsewhere. Just noting that we need to ensure cleaner commits are also handled correctly for archiving > General Redesign of Archived Timeline for efficient scan and management > ----------------------------------------------------------------------- > > Key: HUDI-309 > URL: https://issues.apache.org/jira/browse/HUDI-309 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: Common Core > Reporter: Balaji Varadarajan > Assignee: Vinoth Chandar > Priority: Major > Fix For: 0.5.1 > > Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived > Timeline Notes by Vinoth 2.jpg > > > As designed by Vinoth: > Goals > # Archived Metadata should be scannable in the same way as data > # Provides more safety by always serving committed data independent of > timeframe when the corresponding commit action was tried. Currently, we > implicitly assume a data file to be valid if its commit time is older than > the earliest time in the active timeline. While this works ok, any inherent > bugs in rollback could inadvertently expose a possibly duplicate file when > its commit timestamp becomes older than that of any commits in the timeline. > # We had to deal with lot of corner cases because of the way we treat a > "commit" as special after it gets archived. Examples also include Savepoint > handling logic by cleaner. > # Small Files : For Cloud stores, archiving simply moves fils from one > directory to another causing the archive folder to grow. We need a way to > efficiently compact these files and at the same time be friendly to scans > Design: > The basic file-group abstraction for managing file versions for data files > can be extended to managing archived commit metadata. The idea is to use an > optimal format (like HFile) for storing compacted version of <commitTime, > Metadata> pairs. Every archiving run will read <commitTime, Metadata> pairs > from active timeline and append to indexable log files. We will run periodic > minor compactions to merge multiple log files to a compacted HFile storing > metadata for a time-range. It should be also noted that we will partition by > the action types (commit/clean). This design would allow for the archived > timeline to be queryable for determining whether a timeline is valid or not. -- This message was sent by Atlassian Jira (v8.3.4#803005)