This is an automated email from the ASF dual-hosted git repository.
codope pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new a7c3d66c9d [DOCS] Update lingering read.streaming.start-commit config
values and added disclaimer for `read.streaming.skip_compaction` (#6855)
a7c3d66c9d is described below
commit a7c3d66c9d35358dd834abae71793e74cefc29fb
Author: voonhous <[email protected]>
AuthorDate: Tue Nov 29 21:35:49 2022 +0800
[DOCS] Update lingering read.streaming.start-commit config values and added
disclaimer for `read.streaming.skip_compaction` (#6855)
---
website/docs/flink-quick-start-guide.md | 2 +-
website/docs/querying_data.md | 27 +++++++++++++++++-----
.../version-0.11.0/flink-quick-start-guide.md | 2 +-
.../versioned_docs/version-0.11.0/querying_data.md | 2 +-
.../version-0.11.1/flink-quick-start-guide.md | 2 +-
.../versioned_docs/version-0.11.1/querying_data.md | 2 +-
.../version-0.12.0/flink-quick-start-guide.md | 2 +-
.../versioned_docs/version-0.12.0/querying_data.md | 2 +-
8 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/website/docs/flink-quick-start-guide.md
b/website/docs/flink-quick-start-guide.md
index 5108f3f893..378fd788c8 100644
--- a/website/docs/flink-quick-start-guide.md
+++ b/website/docs/flink-quick-start-guide.md
@@ -310,7 +310,7 @@ WITH (
select * from t1;
```
-This will give all changes that happened after the
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit.
The unique thing about this
feature is that it now lets you author streaming pipelines on streaming or
batch data source.
### Delete Data {#deletes}
diff --git a/website/docs/querying_data.md b/website/docs/querying_data.md
index 27551e5235..d3ba702a7d 100644
--- a/website/docs/querying_data.md
+++ b/website/docs/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet
(already on the roadmap).
For MERGE_ON_READ table, in order to query hudi table as a streaming, you need
to add option `'read.streaming.enabled' = 'true'`,
when querying the table, a Flink streaming pipeline starts and never ends
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source
monitoring interval with option
`read.streaming.check-interval`.
### Streaming Query
@@ -138,14 +138,29 @@ value as `earliest` if you want to consume all the
history data set.
| ----------- | ------- | ------- | ------- |
| `read.streaming.enabled` | false | `false` | Specify `true` to read as
streaming |
| `read.start-commit` | false | the latest commit | Start commit time in
format 'yyyyMMddHHmmss', use `earliest` to consume from the start commit |
-| `read.streaming.skip_compaction` | false | `false` | Whether to skip
compaction commits while reading, generally for two purposes: 1) Avoid
consuming duplications from the compaction instants 2) When change log mode is
enabled, to only consume change logs for right semantics. |
+| `read.streaming.skip_compaction` | false | `false` | Whether to skip
compaction instants for streaming read, generally for two purpose: 1) Avoid
consuming duplications from compaction instants created for created by Hudi
versions < 0.11.0 or when `hoodie.compaction.preserve.commit.metadata` is
disabled 2) When change log mode is enabled, to only consume change for right
semantics. |
| `clean.retain_commits` | false | `10` | The max number of commits to retain
before cleaning, when change log mode is enabled, tweaks this option to adjust
the change log live time. For example, the default strategy keeps 50 minutes of
change logs if the checkpoint interval is set up as 5 minutes. |
:::note
-When option `read.streaming.skip_compaction` turns on and the streaming reader
lags behind by commits of number
-`clean.retain_commits`, the data loss may occur. The compaction keeps the
original instant time as the per-record metadata,
-the streaming reader would read and skip the whole base files if the log has
been consumed. For efficiency, option `read.streaming.skip_compaction`
-is till suggested being `true`.
+When option `read.streaming.skip_compaction` is enabled and the streaming
reader lags behind by commits of number
+`clean.retain_commits`, data loss may occur.
+
+The compaction table service action preserves the original commit time for
each row. When iterating through the parquet files,
+the streaming reader will perform a check on whether the row's commit time
falls within the specified instant range to
+skip over rows that have been read before.
+
+For efficiency, option `read.streaming.skip_compaction` can be enabled to skip
reading of parquet files entirely.
+:::
+
+:::note
+`read.streaming.skip_compaction` should only be enabled if the MOR table is
compacted by Hudi with versions `< 0.11.0`.
+
+This is so as the `hoodie.compaction.preserve.commit.metadata` feature is only
introduced in Hudi versions `>=0.11.0`.
+Older versions will overwrite the original commit time for each row with the
compaction plan's instant time.
+
+This will render Hudi-on-Flink's stream reader's row-level instant-range
checks to not work properly.
+When the original instant time is overwritten with a newer instant time, the
stream reader will not be able to
+differentiate rows that have already been read before with actual new rows.
:::
### Incremental Query
diff --git a/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
b/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
index e9869c7fce..14bab8ab60 100644
--- a/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.0/flink-quick-start-guide.md
@@ -162,7 +162,7 @@ WITH (
select * from t1;
```
-This will give all changes that happened after the
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit.
The unique thing about this
feature is that it now lets you author streaming pipelines on streaming or
batch data source.
### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.11.0/querying_data.md
b/website/versioned_docs/version-0.11.0/querying_data.md
index e07d54ce43..b7cc285ab1 100644
--- a/website/versioned_docs/version-0.11.0/querying_data.md
+++ b/website/versioned_docs/version-0.11.0/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet
(already on the roadmap).
For MERGE_ON_READ table, in order to query hudi table as a streaming, you need
to add option `'read.streaming.enabled' = 'true'`,
when querying the table, a Flink streaming pipeline starts and never ends
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source
monitoring interval with option
`read.streaming.check-interval`.
## Hive
diff --git a/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
b/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
index e9869c7fce..14bab8ab60 100644
--- a/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.11.1/flink-quick-start-guide.md
@@ -162,7 +162,7 @@ WITH (
select * from t1;
```
-This will give all changes that happened after the
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit.
The unique thing about this
feature is that it now lets you author streaming pipelines on streaming or
batch data source.
### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.11.1/querying_data.md
b/website/versioned_docs/version-0.11.1/querying_data.md
index a7a8a6c7ab..38ab138b39 100644
--- a/website/versioned_docs/version-0.11.1/querying_data.md
+++ b/website/versioned_docs/version-0.11.1/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet
(already on the roadmap).
For MERGE_ON_READ table, in order to query hudi table as a streaming, you need
to add option `'read.streaming.enabled' = 'true'`,
when querying the table, a Flink streaming pipeline starts and never ends
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source
monitoring interval with option
`read.streaming.check-interval`.
## Hive
diff --git a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
index d0dd5cdf10..254a532785 100644
--- a/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
+++ b/website/versioned_docs/version-0.12.0/flink-quick-start-guide.md
@@ -310,7 +310,7 @@ WITH (
select * from t1;
```
-This will give all changes that happened after the
`read.streaming.start-commit` commit. The unique thing about this
+This will give all changes that happened after the `read.start-commit` commit.
The unique thing about this
feature is that it now lets you author streaming pipelines on streaming or
batch data source.
### Delete Data {#deletes}
diff --git a/website/versioned_docs/version-0.12.0/querying_data.md
b/website/versioned_docs/version-0.12.0/querying_data.md
index e2a4782f04..40bd594ca3 100644
--- a/website/versioned_docs/version-0.12.0/querying_data.md
+++ b/website/versioned_docs/version-0.12.0/querying_data.md
@@ -125,7 +125,7 @@ in the filter. Filters push down is not supported yet
(already on the roadmap).
For MERGE_ON_READ table, in order to query hudi table as a streaming, you need
to add option `'read.streaming.enabled' = 'true'`,
when querying the table, a Flink streaming pipeline starts and never ends
until the user cancel the job manually.
-You can specify the start commit with option `read.streaming.start-commit` and
source monitoring interval with option
+You can specify the start commit with option `read.start-commit` and source
monitoring interval with option
`read.streaming.check-interval`.
### Streaming Query