[hudi] branch asf-site updated: [DOCS] Update lingering read.streaming.start-commit config values and added disclaimer for `read.streaming.skip_compaction` (#6855)

codope Tue, 29 Nov 2022 05:36:17 -0800

This is an automated email from the ASF dual-hosted git repository.

codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new a7c3d66c9d [DOCS] Update lingering read.streaming.start-commit config 
values and added disclaimer for `read.streaming.skip_compaction` (#6855)
a7c3d66c9d is described below

commit a7c3d66c9d35358dd834abae71793e74cefc29fb
Author: voonhous <[email protected]>
AuthorDate: Tue Nov 29 21:35:49 2022 +0800

    [DOCS] Update lingering read.streaming.start-commit config values and added 
disclaimer for `read.streaming.skip_compaction` (#6855)
---
 website/docs/flink-quick-start-guide.md            |  2 +-
 website/docs/querying_data.md                      | 27 +++++++++++++++++-----
 .../version-0.11.0/flink-quick-start-guide.md      |  2 +-
 .../versioned_docs/version-0.11.0/querying_data.md |  2 +-
 .../version-0.11.1/flink-quick-start-guide.md      |  2 +-
 .../versioned_docs/version-0.11.1/querying_data.md |  2 +-
 .../version-0.12.0/flink-quick-start-guide.md      |  2 +-
 .../versioned_docs/version-0.12.0/querying_data.md |  2 +-
 8 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/website/docs/flink-quick-start-guide.md 
b/website/docs/flink-quick-start-guide.md
index 5108f3f893..378fd788c8 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -310,7 +310,7 @@ WITH (
 select * from t1;
 ``` 
 
-This will give all changes that happened after the 
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit. 
The unique thing about this
 feature is that it now lets you author streaming pipelines on streaming or 
batch data source.
 
 ### Delete Data {#deletes}
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index 27551e5235..d3ba702a7d 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet 
(already on the roadmap).
 
 For MERGE_ON_READ table, in order to query hudi table as a streaming, you need 
to add option `'read.streaming.enabled' = 'true'`,
 when querying the table, a Flink streaming pipeline starts and never ends 
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and 
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source 
monitoring interval with option
 `read.streaming.check-interval`.
 
 ### Streaming Query
@@ -138,14 +138,29 @@ value as `earliest` if you want to consume all the 
history data set.
 |  -----------  | -------  | ------- | ------- |
 | `read.streaming.enabled` | false | `false` | Specify `true` to read as 
streaming |
 | `read.start-commit` | false | the latest commit | Start commit time in 
format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
-| `read.streaming.skip_compaction` | false | `false` | Whether to skip 
compaction commits while reading, generally for two purposes: 1) Avoid 
consuming duplications from the compaction instants 2) When change log mode is 
enabled, to only consume change logs for right semantics. |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip 
compaction instants for streaming read, generally for two purpose: 1) Avoid 
consuming duplications from compaction instants created for created by Hudi 
versions < 0.11.0 or when `hoodie.compaction.preserve.commit.metadata` is 
disabled 2) When change log mode is enabled, to only consume change for right 
semantics. |
 | `clean.retain_commits` | false | `10` | The max number of commits to retain 
before cleaning, when change log mode is enabled, tweaks this option to adjust 
the change log live time. For example, the default strategy keeps 50 minutes of 
change logs if the checkpoint interval is set up as 5 minutes. |
 
 :::note
-When option `read.streaming.skip_compaction` turns on and the streaming reader 
lags behind by commits of number
-`clean.retain_commits`, the data loss may occur. The compaction keeps the 
original instant time as the per-record metadata,
-the streaming reader would read and skip the whole base files if the log has 
been consumed. For efficiency, option `read.streaming.skip_compaction`
-is till suggested being `true`.
+When option `read.streaming.skip_compaction` is enabled and the streaming 
reader lags behind by commits of number
+`clean.retain_commits`, data loss may occur. 
+
+The compaction table service action preserves the original commit time for 
each row. When iterating through the parquet files, 
+the streaming reader will perform a check on whether the row's commit time 
falls within the specified instant range to 
+skip over rows that have been read before. 
+
+For efficiency, option `read.streaming.skip_compaction` can be enabled to skip 
reading of parquet files entirely.
+:::
+
+:::note
+`read.streaming.skip_compaction` should only be enabled if the MOR table is 
compacted by Hudi with versions `< 0.11.0`. 
+
+This is so as the `hoodie.compaction.preserve.commit.metadata` feature is only 
introduced in Hudi versions `>=0.11.0`.
+Older versions will overwrite the original commit time for each row with the 
compaction plan's instant time.
+
+This will render Hudi-on-Flink's stream reader's row-level instant-range 
checks to not work properly. 
+When the original instant time is overwritten with a newer instant time, the 
stream reader will not be able to 
+differentiate rows that have already been read before with actual new rows.
 :::
 
 ### Incremental Query
diff --git a/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md 
b/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
index e9869c7fce..14bab8ab60 100644
--- a/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
@@ -162,7 +162,7 @@ WITH (
 select * from t1;
 ``` 
 
-This will give all changes that happened after the 
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit. 
The unique thing about this
 feature is that it now lets you author streaming pipelines on streaming or 
batch data source.
 
 ### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.11.0/querying_data.md 
b/website/versioned_docs/version-0.11.0/querying_data.md
index e07d54ce43..b7cc285ab1 100644
--- a/website/versioned_docs/version-0.11.0/querying_data.md
+++ b/website/versioned_docs/version-0.11.0/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet 
(already on the roadmap).
 
 For MERGE_ON_READ table, in order to query hudi table as a streaming, you need 
to add option `'read.streaming.enabled' = 'true'`,
 when querying the table, a Flink streaming pipeline starts and never ends 
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and 
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source 
monitoring interval with option
 `read.streaming.check-interval`.
 
 ## Hive
diff --git a/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md 
b/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
index e9869c7fce..14bab8ab60 100644
--- a/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
@@ -162,7 +162,7 @@ WITH (
 select * from t1;
 ``` 
 
-This will give all changes that happened after the 
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit. 
The unique thing about this
 feature is that it now lets you author streaming pipelines on streaming or 
batch data source.
 
 ### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.11.1/querying_data.md 
b/website/versioned_docs/version-0.11.1/querying_data.md
index a7a8a6c7ab..38ab138b39 100644
--- a/website/versioned_docs/version-0.11.1/querying_data.md
+++ b/website/versioned_docs/version-0.11.1/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet 
(already on the roadmap).
 
 For MERGE_ON_READ table, in order to query hudi table as a streaming, you need 
to add option `'read.streaming.enabled' = 'true'`,
 when querying the table, a Flink streaming pipeline starts and never ends 
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and 
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source 
monitoring interval with option
 `read.streaming.check-interval`.
 
 ## Hive
diff --git a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md 
b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
index d0dd5cdf10..254a532785 100644
--- a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
@@ -310,7 +310,7 @@ WITH (
 select * from t1;
 ``` 
 
-This will give all changes that happened after the 
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit. 
The unique thing about this
 feature is that it now lets you author streaming pipelines on streaming or 
batch data source.
 
 ### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.12.0/querying_data.md 
b/website/versioned_docs/version-0.12.0/querying_data.md
index e2a4782f04..40bd594ca3 100644
--- a/website/versioned_docs/version-0.12.0/querying_data.md
+++ b/website/versioned_docs/version-0.12.0/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet 
(already on the roadmap).
 
 For MERGE_ON_READ table, in order to query hudi table as a streaming, you need 
to add option `'read.streaming.enabled' = 'true'`,
 when querying the table, a Flink streaming pipeline starts and never ends 
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and 
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source 
monitoring interval with option
 `read.streaming.check-interval`.
 
 ### Streaming Query

[hudi] branch asf-site updated: [DOCS] Update lingering read.streaming.start-commit config values and added disclaimer for `read.streaming.skip_compaction` (#6855)

Reply via email to