[
https://issues.apache.org/jira/browse/HUDI-2751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17718151#comment-17718151
]
Danny Chen commented on HUDI-2751:
----------------------------------
> So, no records from the new base parquet file created from compaction will be
> served with incremental read
That's true, to optimize it further, these parquet files from compaction and
clustering could be skipped for incremental sources, we already impl that for
Flink streaming reader by adding two options:
{code:java}
read.streaming.skip_compaction
read.streaming.skip_clustering{code}
I think we can close this out.
> To avoid the duplicates for streaming read MOR table
> ----------------------------------------------------
>
> Key: HUDI-2751
> URL: https://issues.apache.org/jira/browse/HUDI-2751
> Project: Apache Hudi
> Issue Type: Improvement
> Components: Common Core
> Reporter: Danny Chen
> Assignee: sivabalan narayanan
> Priority: Critical
>
> Imagine there are commits on the timeline:
> {noformat}
> -----delta-99 ----- commit 100(include 99 delta data
> set) ----- delta-101 ----- delta-102 -----
> first read ->| second read ->
> – range 1 ---| ----------------------range 2
> -------------------|
> {noformat}
> instant 99, 101, 102 are successful non-compaction delta commits;
> instant 100 is successful compaction instant.
> The first inc read consumes to instant 99 and the second read consumes from
> instant 100 to instant 102, the second read would consumes the commit files
> of instant 100 which has already been consumed before.
> The duplicate reading happens when this condition triggers: a compaction
> instant schedules then completes in *one* consume range.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)