[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

Raymond Xu (Jira) Wed, 27 Nov 2019 16:42:37 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984047#comment-16984047
 ]


Raymond Xu commented on HUDI-309:
---------------------------------

[~nicholasjiang] I think the design in "Description" is meant for treating 
<commitTime, metadata> pair as an event (like a message consumed from Kafka) 
and then append it to some metadata log files, which will then be compacted 
into HFile for future query purpose. I think for the same active timeline, 
commitTime is the natural identifier for commit metadata. ([~vinoth] [~vbalaji] 
please correct me if I misunderstand it)

So far my concern for this design is in the action type partition: as the 
action types are fixed to a small number, the files under each type partition 
will keep growing. Eventually if too many files accumulated under a particular 
partition (say "commit/"), would that cause issue on scanning?

How about, as also [~nicholasjiang] points out, partitioning by "commitTime" 
(converted to "yyyy/MM/dd" or "yyyy/MM" or configurable) that would set certain 
upper bound to the number of files?

> General Redesign of Archived Timeline for efficient scan and management
> -----------------------------------------------------------------------
>
>                 Key: HUDI-309
>                 URL: https://issues.apache.org/jira/browse/HUDI-309
>             Project: Apache Hudi (incubating)
>          Issue Type: New Feature
>          Components: Common Core
>            Reporter: Balaji Varadarajan
>            Assignee: Vinoth Chandar
>            Priority: Major
>             Fix For: 0.5.1
>
>         Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of <commitTime, 
> Metadata> pairs. Every archiving run will read <commitTime, Metadata> pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

Reply via email to