[
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16971747#comment-16971747
]
Balaji Varadarajan commented on HUDI-309:
-----------------------------------------
[https://github.com/apache/incubator-hudi/blob/23b303e4b17c5f7b603900ee5b0d2e6718118014/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L860]
{code:java}
if (!table.getActiveTimeline().getCleanerTimeline().empty()) {
logger.info("Cleaning up older rollback meta files");
// Cleanup of older cleaner meta files
// TODO - make the commit archival generic and archive rollback
metadata
FSUtils.deleteOlderRollbackMetaFiles(fs,
table.getMetaClient().getMetaPath(),
table.getActiveTimeline().getRollbackTimeline().getInstants());
}
{code}
As part of PR-942, the above code is removed as it is handled elsewhere. Just
noting that we need to ensure cleaner commits are also handled correctly for
archiving
> General Redesign of Archived Timeline for efficient scan and management
> -----------------------------------------------------------------------
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
> Issue Type: New Feature
> Components: Common Core
> Reporter: Balaji Varadarajan
> Assignee: Vinoth Chandar
> Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
> # Archived Metadata should be scannable in the same way as data
> # Provides more safety by always serving committed data independent of
> timeframe when the corresponding commit action was tried. Currently, we
> implicitly assume a data file to be valid if its commit time is older than
> the earliest time in the active timeline. While this works ok, any inherent
> bugs in rollback could inadvertently expose a possibly duplicate file when
> its commit timestamp becomes older than that of any commits in the timeline.
> # We had to deal with lot of corner cases because of the way we treat a
> "commit" as special after it gets archived. Examples also include Savepoint
> handling logic by cleaner.
> # Small Files : For Cloud stores, archiving simply moves fils from one
> directory to another causing the archive folder to grow. We need a way to
> efficiently compact these files and at the same time be friendly to scans
> Design:
> The basic file-group abstraction for managing file versions for data files
> can be extended to managing archived commit metadata. The idea is to use an
> optimal format (like HFile) for storing compacted version of <commitTime,
> Metadata> pairs. Every archiving run will read <commitTime, Metadata> pairs
> from active timeline and append to indexable log files. We will run periodic
> minor compactions to merge multiple log files to a compacted HFile storing
> metadata for a time-range. It should be also noted that we will partition by
> the action types (commit/clean). This design would allow for the archived
> timeline to be queryable for determining whether a timeline is valid or not.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)