[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

sivabalan narayanan (Jira) Sun, 20 Mar 2022 20:07:05 -0700


    [ 
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17509568#comment-17509568
 ]


sivabalan narayanan commented on HUDI-2751:
-------------------------------------------

Synced up via direct chat w/ Danny. here is the gist. 

Streaming read in both spark and flink will watch for new timeline files and 
serve them to the caller. Even though we do filter for commit times in records, 
this ticket is about optimization where in we can avoid the filtering if 
possible. 

but this spans multiple areas and not just MOR compaction.

For eg, COW merge, MOR compaction and clustering as well. 

 

So, this needs to holistic thought. Myself and Danny will tackle this for 0.12. 

 

 

 

> To avoid the duplicates for streaming read MOR table
> ----------------------------------------------------
>
>                 Key: HUDI-2751
>                 URL: https://issues.apache.org/jira/browse/HUDI-2751
>             Project: Apache Hudi
>          Issue Type: Task
>          Components: Common Core
>            Reporter: Danny Chen
>            Assignee: sivabalan narayanan
>            Priority: Blocker
>             Fix For: 0.11.0
>
>
> Imagine there are commits on the timeline:
> {noformat}
>                          -----delta-99 ----- commit 100(include 99 delta data 
> set) ----- delta-101 ----- delta-102 -----
>                           first read ->| second read ->
>                          – range 1 ---| ----------------------range 2 
> -------------------|
> {noformat}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is successful compaction instant.
> The first inc read consumes to instant 99 and the second read consumes from 
> instant 100 to instant 102, the second read would consumes the commit files 
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction 
> instant schedules then completes in *one* consume range.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (HUDI-2751) To avoid the duplicates for streaming read MOR table

Reply via email to