[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2023-09-07 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762702#comment-17762702
 ] 

Vinoth Chandar commented on HUDI-309:
-

I think we can close this, given 9209 has landed?

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2023-08-16 Thread Danny Chen (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17755352#comment-17755352
 ] 

Danny Chen commented on HUDI-309:
-

Related PR: https://github.com/apache/hudi/pull/9209

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Danny Chen
>Priority: Major
> Fix For: 1.0.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2020-04-25 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17092430#comment-17092430
 ] 

Vinoth Chandar commented on HUDI-309:
-

Closing the loop here.. https://issues.apache.org/jira/browse/HUDI-839 tracks 
this..  

Untagging this from the 0.6.0 release 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2020-03-31 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071842#comment-17071842
 ] 

Vinoth Chandar commented on HUDI-309:
-

Folks, I think we would agree this is a large effort with potential overall 
with RFC-15.. I was thinking about a way to make progress here.. on this 
specific problem and unblock other projects along the way. 

Specific problem : During write operations, cache the input using spark 
caching, compute a workload profile for purposes of file sizing etc. We also 
use persist this information in the inflight commit/deltacommit file, for doing 
rollbacks. i.e if the write fails midway leaving a .inflight 
commit/deltacommit, then upon the next write, we will read the workload profile 
written into the commit/deltacommit and then attempt delete left over files or 
log rollback blocks into log files to nullify the partial writes we might have 
written...  Note that we will not read base or log files that are inflight in 
the active time by checking if the instant was inflight. but if we don't 
perform any rollback action and enough time passes, then this instant will be 
archived and that's where the trouble is. Once an instant goes into archived 
timeline today, there is no way to check it's individual state (inflight vs 
completed).. and this is what the JIRA was trying to handle in a generic way, 
so that the memory caching requirement is not used in this critical way 
functionally.  

Thinking back, I think we can shelve this JIRA as a longer term effect. and use 
an alternate approach to solve the specific problem above.. During each write 
(from Create and Merge handles, code is in HoodieTable.java) we already write 
out marker files under .hoodie that correspond 1-1 with 1 file being created or 
merged today. In case of partial write, this marker file might be left behind 
(need to ensure in code that we commit first and then delete markers) and we 
can directly use this to perform the rollback... (note that we need to handle 
backwards compatibility with existing timelines, support also downgrades to old 
ways)

Let me know if this makes sense in a general way.. We can file a separate JIRA 
and get working on it.. 

[~xleesf] [~vbalaji] [~vinoyang] [~nagarwal] 


> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.6.0
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2020-01-21 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17020347#comment-17020347
 ] 

Raymond Xu commented on HUDI-309:
-

Made a design diagram for ease of discussion

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.2
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-09 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991749#comment-16991749
 ] 

Raymond Xu commented on HUDI-309:
-

[~vbalaji] Thanks for the clarification. Yes January would be great for me as 
well. (vacation too :))

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-09 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991746#comment-16991746
 ] 

Balaji Varadarajan commented on HUDI-309:
-

[~nicholasjiang] : Absolutely. 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-09 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991682#comment-16991682
 ] 

Nicholas Jiang commented on HUDI-309:
-

[~vbalaji]Yeah, I would like to collaborate with you for this redesign. I will 
wait for you to refactor this together. And We could discuss further in Hudi 
Slack Group. Could you please?

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-09 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991654#comment-16991654
 ] 

Balaji Varadarajan commented on HUDI-309:
-

[~nicholasjiang] [~rxu] : Sorry for the late reply. Yes, we would need 
multi-level partitioning (actionType, commitTime) to make the storage structure 
efficient for timeline query use-cases. 

I would be more than happy to collaborate with you in discussing further and 
coming up with final complete design. Would it work for you if we can revive 
this discussion in January ?  I was planning to work on this once I finish up 
the Bootstrap proposal for Hudi and back from vacation :)

 

 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-08 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16991142#comment-16991142
 ] 

Vinoth Chandar commented on HUDI-309:
-

[~vbalaji] ^^ 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-07 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990474#comment-16990474
 ] 

Nicholas Jiang commented on HUDI-309:
-

[~vinoth]Thanks very much. And [~vbalaji], what's your plan about this 
refactor? I have interest in cooperation with you to complete the plan.

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-06 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990037#comment-16990037
 ] 

Vinoth Chandar commented on HUDI-309:
-

[~vbalaji] is actually driving the plan (may be you should own the JIRA?).. Can 
you please respond to Nicholas. 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-02 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986500#comment-16986500
 ] 

Nicholas Jiang commented on HUDI-309:
-

Thanks for [~vinoth][~rxu] explanation. At present I am more concerned about 
how this plan looks like, and whether I can participate in the development of 
subtasks. I have much interest in this module refeactor.

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-12-02 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16986261#comment-16986261
 ] 

Vinoth Chandar commented on HUDI-309:
-

> the same active timeline, commitTime is the natural identifier for commit 
> metadata.
 commitTime + activeType + state would be identifier..

>the files under each type partition will keep growing. Eventually if too many 
>files accumulated under a particular partition (say "commit/"), would that 
>cause issue on scanning?
 Since we would also compact these as you mention, it wont grow unbounded 
actually? 

 

The rationale to partition by action type comes from how we access these in 
code. (active timeline will remain as-is), we typically obtain a specific 
timeline of actions : cleanTimeline, commitTimeline, deltaCommitTimeline etc.. 
We can sub partition by commitTime if it helps speed things up and seems like a 
scalable approach. We could incorporate both. Without the partitioingby action, 
we would have to scan the entire range of timeline for each action..

 

All that said, what we have here is a strawman proposal. We will have to a RFC 
and flesh these out more.

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-27 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16984047#comment-16984047
 ] 

Raymond Xu commented on HUDI-309:
-

[~nicholasjiang] I think the design in "Description" is meant for treating 
 pair as an event (like a message consumed from Kafka) 
and then append it to some metadata log files, which will then be compacted 
into HFile for future query purpose. I think for the same active timeline, 
commitTime is the natural identifier for commit metadata. ([~vinoth] [~vbalaji] 
please correct me if I misunderstand it)

So far my concern for this design is in the action type partition: as the 
action types are fixed to a small number, the files under each type partition 
will keep growing. Eventually if too many files accumulated under a particular 
partition (say "commit/"), would that cause issue on scanning?

How about, as also [~nicholasjiang] points out, partitioning by "commitTime" 
(converted to "/MM/dd" or "/MM" or configurable) that would set certain 
upper bound to the number of files?

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-26 Thread Nicholas Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16983139#comment-16983139
 ] 

Nicholas Jiang commented on HUDI-309:
-

Hi, [~vinoth] & [~vbalaji] & [~rxu], In my option, partitioning by commitTime 
could avoid generating many small files scenes. And the optimal format for 
storing compacted version of  pairs. Based on 
partitioning by commitTime, we could firstly route the partition when scaning 
the archived timeline, then find the metadata mapping to the partition.

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-18 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976589#comment-16976589
 ] 

Vinoth Chandar commented on HUDI-309:
-

[~rxu] We typically look up timeline by its action type (commits, deltacommits) 
like the code above. It just matches that. One of the main goals in this effort 
is to be able to "lookup" state of any instant/action and make decisions.. 

A lot of things are still TBD. But this is the high level idea atm. 

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-15 Thread Raymond Xu (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16975525#comment-16975525
 ] 

Raymond Xu commented on HUDI-309:
-

[~vinoth]

Just trying to understand the new design..could you help clarify what's the 
advantage of partitioning by action types? Is it to avoid redundant scanning? 
Would that be better off if partitioning by datetime for even distribution?

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-11-11 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971747#comment-16971747
 ] 

Balaji Varadarajan commented on HUDI-309:
-

[https://github.com/apache/incubator-hudi/blob/23b303e4b17c5f7b603900ee5b0d2e6718118014/hudi-client/src/main/java/org/apache/hudi/HoodieWriteClient.java#L860]
{code:java}
if (!table.getActiveTimeline().getCleanerTimeline().empty()) {
logger.info("Cleaning up older rollback meta files");
// Cleanup of older cleaner meta files
// TODO - make the commit archival generic and archive rollback 
metadata 
FSUtils.deleteOlderRollbackMetaFiles(fs, 
table.getMetaClient().getMetaPath(),  

table.getActiveTimeline().getRollbackTimeline().getInstants());
}
{code}
 

As part of PR-942, the above code is removed as it is handled elsewhere. Just 
noting that we need to ensure cleaner commits are also handled correctly for 
archiving

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Vinoth Chandar
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-10-20 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955672#comment-16955672
 ] 

Vinoth Chandar commented on HUDI-309:
-

https://issues.apache.org/jira/browse/HUDI-261 for more as well

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Assignee: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-309) General Redesign of Archived Timeline for efficient scan and management

2019-10-20 Thread Balaji Varadarajan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955669#comment-16955669
 ] 

Balaji Varadarajan commented on HUDI-309:
-

Look at HUDI-167 for more details.

> General Redesign of Archived Timeline for efficient scan and management
> ---
>
> Key: HUDI-309
> URL: https://issues.apache.org/jira/browse/HUDI-309
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: Common Core
>Reporter: Balaji Varadarajan
>Priority: Major
> Fix For: 0.5.1
>
> Attachments: Archive TImeline Notes by Vinoth 1.jpg, Archived 
> Timeline Notes by Vinoth 2.jpg
>
>
> As designed by Vinoth:
> Goals
>  # Archived Metadata should be scannable in the same way as data
>  # Provides more safety by always serving committed data independent of 
> timeframe when the corresponding commit action was tried. Currently, we 
> implicitly assume a data file to be valid if its commit time is older than 
> the earliest time in the active timeline. While this works ok, any inherent 
> bugs in rollback could inadvertently expose a possibly duplicate file when 
> its commit timestamp becomes older than that of any commits in the timeline.
>  # We had to deal with lot of corner cases because of the way we treat a 
> "commit" as special after it gets archived. Examples also include Savepoint 
> handling logic by cleaner.
>  # Small Files : For Cloud stores, archiving simply moves fils from one 
> directory to another causing the archive folder to grow. We need a way to 
> efficiently compact these files and at the same time be friendly to scans
> Design:
>  The basic file-group abstraction for managing file versions for data files 
> can be extended to managing archived commit metadata. The idea is to use an 
> optimal format (like HFile) for storing compacted version of  Metadata> pairs. Every archiving run will read  pairs 
> from active timeline and append to indexable log files. We will run periodic 
> minor compactions to merge multiple log files to a compacted HFile storing 
> metadata for a time-range. It should be also noted that we will partition by 
> the action types (commit/clean).  This design would allow for the archived 
> timeline to be queryable for determining whether a timeline is valid or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)