[hudi] branch asf-site updated: [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)

danny0405 Mon, 09 May 2022 02:47:26 -0700

This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 964ba2329b [HUDI-4063] Update the site doc for flink since release 
0.11 (#5538)
964ba2329b is described below

commit 964ba2329b6c96902941254c4196f193cf543d02
Author: Danny Chan <[email protected]>
AuthorDate: Mon May 9 17:47:16 2022 +0800

    [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)
---
 website/docs/compaction.md                                    |  4 +++-
 website/docs/hoodie_deltastreamer.md                          | 10 +++++-----
 website/versioned_docs/version-0.11.0/compaction.md           |  4 +++-
 website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md | 10 +++++-----
 4 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index fe679f4ac9..9d73e31bd5 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -135,4 +135,6 @@ Offline compaction needs to submit the Flink task on the 
command line. The progr
 | `--path` | `frue` | `--` | The path where the target table is stored on Hudi 
|
 | `--compaction-max-memory` | `false` | `100` | The index map size of log data 
during compaction, 100 MB by default. If you have enough memory, you can turn 
up this parameter |
 | `--schedule` | `false` | `false` | whether to execute the operation of 
scheduling compaction plan. When the write process is still writing， turning on 
this parameter have a risk of losing data. Therefore, it must be ensured that 
there are no write tasks currently writing data to this table when this 
parameter is turned on |
-| `--seq` | `false` | `LIFO` | The order in which compaction tasks are 
executed. Executing from the latest compaction plan by default. `LIFO`: 
executing from the latest plan. `FIFO`: executing from the oldest plan. |
\ No newline at end of file
+| `--seq` | `false` | `LIFO` | The order in which compaction tasks are 
executed. Executing from the latest compaction plan by default. `LIFO`: 
executing from the latest plan. `FIFO`: executing from the oldest plan. |
+| `--service` | `false` | `false` | Whether to start a monitoring service that 
checks and schedules new compaction task in configured interval. |
+| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking 
interval for service mode, by default 10 minutes. |
\ No newline at end of file
diff --git a/website/docs/hoodie_deltastreamer.md 
b/website/docs/hoodie_deltastreamer.md
index 6f2c80d5cf..2efa2aa416 100644
--- a/website/docs/hoodie_deltastreamer.md
+++ b/website/docs/hoodie_deltastreamer.md
@@ -369,8 +369,6 @@ We recommend two ways for syncing CDC data into Hudi:
 
 :::note
 - If the upstream data cannot guarantee the order, you need to specify option 
`write.precombine.field` explicitly;
-- The MOR table can not handle DELETEs in event time sequence now, thus 
causing data loss. You better switch on the changelog mode through
-  option `changelog.enabled`.
 :::
 
 ### Bulk Insert
@@ -401,8 +399,8 @@ will rollover to the new file handle. Finally, `the number 
of files` >= [`write.
 |  -----------  | -------  | ------- | ------- |
 | `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open 
this function  |
 | `write.tasks`  |  `false`  | `4` | The parallelism of `bulk_insert`, `the 
number of files` >= 
[`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) |
-| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to 
shuffle data according to the partition field before writing. Enabling this 
option will reduce the number of small files, but there may be a risk of data 
skew |
-| `write.bulk_insert.sort_by_partition` | `false`  | `true` | Whether to sort 
data according to the partition field before writing. Enabling this option will 
reduce the number of small files when a write task writes multiple partitions |
+| `write.bulk_insert.shuffle_input` | `false` | `true` | Whether to shuffle 
data according to the input field before writing. Enabling this option will 
reduce the number of small files, but there may be a risk of data skew |
+| `write.bulk_insert.sort_input` | `false`  | `true` | Whether to sort data 
according to the input field before writing. Enabling this option will reduce 
the number of small files when a write task writes multiple partitions |
 | `write.sort.memory` | `false` | `128` | Available managed memory of sort 
operator. default  `128` MB |
 
 ### Index Bootstrap
@@ -495,7 +493,9 @@ value as `earliest` if you want to consume all the history 
data set.
 
 :::note
 When option `read.streaming.skip_compaction` turns on and the streaming reader 
lags behind by commits of number
-`clean.retain_commits`, the data loss may occur.
+`clean.retain_commits`, the data loss may occur. The compaction keeps the 
original instant time as the per-record metadata,
+the streaming reader would read and skip the whole base files if the log has 
been consumed. For efficiency, option `read.streaming.skip_compaction`
+is till suggested being `true`.
 :::
 
 ### Incremental Query
diff --git a/website/versioned_docs/version-0.11.0/compaction.md 
b/website/versioned_docs/version-0.11.0/compaction.md
index fe679f4ac9..9d73e31bd5 100644
--- a/website/versioned_docs/version-0.11.0/compaction.md
+++ b/website/versioned_docs/version-0.11.0/compaction.md
@@ -135,4 +135,6 @@ Offline compaction needs to submit the Flink task on the 
command line. The progr
 | `--path` | `frue` | `--` | The path where the target table is stored on Hudi 
|
 | `--compaction-max-memory` | `false` | `100` | The index map size of log data 
during compaction, 100 MB by default. If you have enough memory, you can turn 
up this parameter |
 | `--schedule` | `false` | `false` | whether to execute the operation of 
scheduling compaction plan. When the write process is still writing， turning on 
this parameter have a risk of losing data. Therefore, it must be ensured that 
there are no write tasks currently writing data to this table when this 
parameter is turned on |
-| `--seq` | `false` | `LIFO` | The order in which compaction tasks are 
executed. Executing from the latest compaction plan by default. `LIFO`: 
executing from the latest plan. `FIFO`: executing from the oldest plan. |
\ No newline at end of file
+| `--seq` | `false` | `LIFO` | The order in which compaction tasks are 
executed. Executing from the latest compaction plan by default. `LIFO`: 
executing from the latest plan. `FIFO`: executing from the oldest plan. |
+| `--service` | `false` | `false` | Whether to start a monitoring service that 
checks and schedules new compaction task in configured interval. |
+| `--min-compaction-interval-seconds` | `false` | `600(s)` | The checking 
interval for service mode, by default 10 minutes. |
\ No newline at end of file
diff --git a/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md 
b/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
index 6f2c80d5cf..2efa2aa416 100644
--- a/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
+++ b/website/versioned_docs/version-0.11.0/hoodie_deltastreamer.md
@@ -369,8 +369,6 @@ We recommend two ways for syncing CDC data into Hudi:
 
 :::note
 - If the upstream data cannot guarantee the order, you need to specify option 
`write.precombine.field` explicitly;
-- The MOR table can not handle DELETEs in event time sequence now, thus 
causing data loss. You better switch on the changelog mode through
-  option `changelog.enabled`.
 :::
 
 ### Bulk Insert
@@ -401,8 +399,8 @@ will rollover to the new file handle. Finally, `the number 
of files` >= [`write.
 |  -----------  | -------  | ------- | ------- |
 | `write.operation` | `true` | `upsert` | Setting as `bulk_insert` to open 
this function  |
 | `write.tasks`  |  `false`  | `4` | The parallelism of `bulk_insert`, `the 
number of files` >= 
[`write.bucket_assign.tasks`](/docs/configurations#writebucket_assigntasks) |
-| `write.bulk_insert.shuffle_by_partition` | `false` | `true` | Whether to 
shuffle data according to the partition field before writing. Enabling this 
option will reduce the number of small files, but there may be a risk of data 
skew |
-| `write.bulk_insert.sort_by_partition` | `false`  | `true` | Whether to sort 
data according to the partition field before writing. Enabling this option will 
reduce the number of small files when a write task writes multiple partitions |
+| `write.bulk_insert.shuffle_input` | `false` | `true` | Whether to shuffle 
data according to the input field before writing. Enabling this option will 
reduce the number of small files, but there may be a risk of data skew |
+| `write.bulk_insert.sort_input` | `false`  | `true` | Whether to sort data 
according to the input field before writing. Enabling this option will reduce 
the number of small files when a write task writes multiple partitions |
 | `write.sort.memory` | `false` | `128` | Available managed memory of sort 
operator. default  `128` MB |
 
 ### Index Bootstrap
@@ -495,7 +493,9 @@ value as `earliest` if you want to consume all the history 
data set.
 
 :::note
 When option `read.streaming.skip_compaction` turns on and the streaming reader 
lags behind by commits of number
-`clean.retain_commits`, the data loss may occur.
+`clean.retain_commits`, the data loss may occur. The compaction keeps the 
original instant time as the per-record metadata,
+the streaming reader would read and skip the whole base files if the log has 
been consumed. For efficiency, option `read.streaming.skip_compaction`
+is till suggested being `true`.
 :::
 
 ### Incremental Query

[hudi] branch asf-site updated: [HUDI-4063] Update the site doc for flink since release 0.11 (#5538)

Reply via email to