[jira] [Created] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

Balaji Varadarajan (Jira) Sun, 20 Oct 2019 16:49:18 -0700

Balaji Varadarajan created HUDI-309:
---------------------------------------


             Summary: General Redesign of Archived Timeline for efficient scan 
and management
                 Key: HUDI-309
                 URL: https://issues.apache.org/jira/browse/HUDI-309
             Project: Apache Hudi (incubating)
          Issue Type: New Feature
          Components: Common Core
            Reporter: Balaji Varadarajan


As designed by Vinoth:

Goals
 # Archived Metadata should be scannable in the same way as data
 # Provides more safety by always serving committed data independent of 
timeframe when the corresponding commit action was tried. Currently, we 
implicitly assume a data file to be valid if its commit time is older than the 
earliest time in the active timeline. While this works ok, any inherent bugs in 
rollback could inadvertently expose a possibly duplicate file when its commit 
timestamp becomes older than that of any commits in the timeline.
 # We had to deal with lot of corner cases because of the way we treat a 
"commit" as special after it gets archived. Examples also include Savepoint 
handling logic by cleaner.
 # Small Files : For Cloud stores, archiving simply moves fils from one 
directory to another causing the archive folder to grow. We need a way to 
efficiently compact these files and at the same time be friendly to scans

Design:

 The basic file-group abstraction for managing file versions for data files can 
be extended to managing archived commit metadata. The idea is to use an optimal 
format (like HFile) for storing compacted version of <commitTime, Metadata> 
pairs. Every archiving run will read <commitTime, Metadata> pairs from active 
timeline and append to indexable log files. We will run periodic minor 
compactions to merge multiple log files to a compacted HFile storing metadata 
for a time-range. It should be also noted that we will partition by the action 
types (commit/clean).  This design would allow for the archived timeline to be 
queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

Reply via email to