(hudi) branch asf-site updated: docs: update compaction, clustering, and deployment docs (#17886)

xushiyan Wed, 14 Jan 2026 22:48:27 -0800

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 3abb2f10634d docs: update compaction, clustering, and deployment docs 
(#17886)
3abb2f10634d is described below

commit 3abb2f10634dc3fa6a48d0e7b04b65f3ed5a0bf5
Author: Shiyan Xu <[email protected]>
AuthorDate: Thu Jan 15 00:46:32 2026 -0600

    docs: update compaction, clustering, and deployment docs (#17886)
---
 website/docs/clustering.md                         |  38 ++++++++++++++++++
 website/docs/compaction.md                         |  20 ++++++++++
 website/docs/deployment.md                         |  44 ++++++++++++---------
 .../assets/images/upgrade-to-1.0/upgrade1.0-1.png  | Bin 0 -> 69638 bytes
 .../assets/images/upgrade-to-1.0/upgrade1.0-2.png  | Bin 0 -> 68195 bytes
 .../assets/images/upgrade-to-1.0/upgrade1.0-3.png  | Bin 0 -> 66175 bytes
 .../assets/images/upgrade-to-1.0/upgrade1.0-4.png  | Bin 0 -> 67226 bytes
 website/versioned_docs/version-1.1.1/deployment.md |  44 ++++++++++++---------
 8 files changed, 108 insertions(+), 38 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 7c29da609dd3..5ec16c6bc3f7 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -238,6 +238,19 @@ The appropriate mode can be specified using `-mode` or 
`-m` option. There are th
 2. `execute`: Execute a clustering plan at a particular instant. If no 
instant-time is specified, HoodieClusteringJob will execute for the earliest 
instant on the Hudi timeline.
 3. `scheduleAndExecute`: Make a clustering plan first and execute that plan 
immediately.
 
+#### Available Options
+
+In addition to the basic mode options, HoodieClusteringJob supports the 
following retry and timeout options (effective in `scheduleAndExecute` mode):
+
+| Option Name                    | Short Flag | Default | Description          
                                                                                
                                                                                
                                                                                
                                  |
+|--------------------------------|------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--retry-last-failed-job`      | `-rc`      | `false` | When set to true, 
checks, rolls back, and executes the last failed clustering plan instead of 
planning a new clustering job directly. This is useful for recovering from 
previous failures.                                                              
                                              |
+| `--job-max-processing-time-ms` | `-jt`      | `0`     | Maximum processing 
time in milliseconds before considering a clustering job as failed. If this 
time is exceeded and the job is still unfinished, Hudi will consider the job as 
failed and relaunch it (when used with `--retry-last-failed-job`). A value of 0 
or negative disables the timeout check. |
+
+:::note
+These retry options are only effective when using `--mode scheduleAndExecute`. 
The `--retry-last-failed-job` option requires `--job-max-processing-time-ms` to 
be set to a positive value to detect stale inflight instants.
+:::
+
 Note that to run this job while the original writer is still running, please 
enable multi-writing:
 
 ```properties
@@ -342,6 +355,31 @@ def structuredStreamingWithClustering(): Unit = {
 }
 ```
 
+## Flink Offline Clustering
+
+Offline clustering for Flink needs to be submitted as a Flink job on the 
command line. The program entry is in `hudi-flink-bundle.jar`: 
`org.apache.hudi.sink.clustering.HoodieFlinkClusteringJob`
+
+```bash
+# Command line
+./bin/flink run -c org.apache.hudi.sink.clustering.HoodieFlinkClusteringJob 
lib/hudi-flink-bundle.jar --path hdfs://xxx:9000/table
+```
+
+### Options
+
+| Option Name                         | Default              | Description     
                                                                                
                                                                                
                                                                                
                                |
+|-------------------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--path`                            | `n/a **(Required)**` | The path where 
the target table is stored on Hudi                                              
                                                                                
                                                                                
                                 |
+| `--schedule`                        | `false` (Optional)   | Whether to 
execute the operation of scheduling clustering plan. When the write process is 
still writing, turning on this parameter has a risk of losing data. Therefore, 
it must be ensured that there are no write tasks currently writing data to this 
table when this parameter is turned on |
+| `--service`                         | `false`  (Optional)  | Whether to 
start a monitoring service that checks and schedules new clustering task in 
configured interval.                                                            
                                                                                
                                         |
+| `--min-clustering-interval-seconds` | `600(s)` (optional)  | The checking 
interval for service mode, by default 10 minutes.                               
                                                                                
                                                                                
                                   |
+| `--retry`                           | `0` (Optional)       | Number of 
retries for clustering operation. Only effective in single-run mode (not 
service mode). Default is 0 (no retry).                                         
                                                                                
                                             |
+| `--retry-last-failed-job`           | `false` (Optional)   | Check and retry 
last failed clustering job if the inflight instant exceeds max processing time. 
Only effective in single-run mode. Requires `--job-max-processing-time-ms` to 
be set to a positive value.                                                     
                                  |
+| `--job-max-processing-time-ms`      | `0` (Optional)       | Maximum 
processing time in milliseconds before considering a clustering job as failed. 
Used with `--retry-last-failed-job`. Default 0 means no timeout check.          
                                                                                
                                         |
+
+:::note
+The retry options (`--retry`, `--retry-last-failed-job`, 
`--job-max-processing-time-ms`) are only effective in single-run mode, not in 
service mode. Service mode has implicit retry semantics via its continuous 
monitoring loop. A warning will be logged if `--retry-last-failed-job` is 
enabled but `--job-max-processing-time-ms` is not set to a positive value.
+:::
+
 ## Java Client
 
 Clustering is also supported via Java client. Plan strategy 
`org.apache.hudi.client.clustering.plan.strategy.JavaSizeBasedClusteringPlanStrategy`
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index 55aae1f4697f..89b9214f0bd9 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -220,6 +220,19 @@ spark-submit --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.2,or
 Note, the `instant-time` parameter is now optional for the Hudi Compactor 
Utility. If using the utility without `--instant time`,
 the spark-submit will execute the earliest scheduled compaction on the Hudi 
timeline.
 
+##### Available Options
+
+The HoodieCompactor utility supports the following retry and timeout options 
(effective in `scheduleAndExecute` mode):
+
+| Option Name                    | Short Flag | Default | Description          
                                                                                
                                                                                
                                                                                
                                  |
+|--------------------------------|------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--retry-last-failed-job`      | `-rc`      | `false` | When set to true, 
checks, rolls back, and executes the last failed compaction plan instead of 
planning a new compaction job directly. This is useful for recovering from 
previous failures.                                                              
                                              |
+| `--job-max-processing-time-ms` | `-jt`      | `0`     | Maximum processing 
time in milliseconds before considering a compaction job as failed. If this 
time is exceeded and the job is still unfinished, Hudi will consider the job as 
failed and relaunch it (when used with `--retry-last-failed-job`). A value of 0 
or negative disables the timeout check. |
+
+:::note
+These retry options are only effective when using `--mode scheduleAndExecute`. 
The `--retry-last-failed-job` option requires `--job-max-processing-time-ms` to 
be set to a positive value to detect stale inflight instants.
+:::
+
 #### Hudi CLI
 
 Hudi CLI is yet another way to execute specific compactions asynchronously. 
Here is an example and you can read more in the [deployment 
guide](cli.md#compactions)
@@ -251,6 +264,13 @@ Offline compaction needs to submit the Flink task on the 
command line. The progr
 | `--seq`                             | `LIFO`  (Optional)   | The order in 
which compaction tasks are executed. Executing from the latest compaction plan 
by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the 
oldest plan.                                                                    
                                      |
 | `--service`                         | `false`  (Optional)  | Whether to 
start a monitoring service that checks and schedules new compaction task in 
configured interval.                                                            
                                                                                
                                          |
 | `--min-compaction-interval-seconds` | `600(s)` (optional)  | The checking 
interval for service mode, by default 10 minutes.                               
                                                                                
                                                                                
                                    |
+| `--retry`                           | `0` (Optional)       | Number of 
retries for compaction operation. Only effective in single-run mode (not 
service mode). Default is 0 (no retry).                                         
                                                                                
                                              |
+| `--retry-last-failed-job`           | `false` (Optional)   | Check and retry 
last failed compaction job if the inflight instant exceeds max processing time. 
Only effective in single-run mode. Requires `--job-max-processing-time-ms` to 
be set to a positive value.                                                     
                                   |
+| `--job-max-processing-time-ms`      | `0` (Optional)       | Maximum 
processing time in milliseconds before considering a compaction job as failed. 
Used with `--retry-last-failed-job`. Default 0 means no timeout check.          
                                                                                
                                          |
+
+:::note
+The retry options (`--retry`, `--retry-last-failed-job`, 
`--job-max-processing-time-ms`) are only effective in single-run mode, not in 
service mode. Service mode has implicit retry semantics via its continuous 
monitoring loop. A warning will be logged if `--retry-last-failed-job` is 
enabled but `--job-max-processing-time-ms` is not set to a positive value.
+:::
 
 ## Related Resources
 
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 9fa36d8181b4..51f92e40d407 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -6,19 +6,19 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+This section provides all the help you need to deploy and operate Hudi tables 
at scale.
 Specifically, we will cover the following aspects.
 
- - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
- - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices.
- - [Downgrading Versions](#downgrading) : Reverting back to an older version 
of Hudi
- - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
- 
+- [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+- [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices.
+- [Downgrading Versions](#downgrading) : Reverting back to an older version of 
Hudi
+- [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+
 ## Deploying
 
 All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
 using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (Hudi Streamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview).
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary. 
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary.
 
 A typical Hudi data ingestion can be achieved in 2 modes. In a single run 
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and 
exits. In continuous mode, Hudi ingestion runs as a long-running service 
executing ingestion in a loop.
 
@@ -26,18 +26,18 @@ With Merge_On_Read Table, Hudi ingestion needs to also take 
care of compacting d
 
 ### Hudi Streamer
 
-[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone 
utility to incrementally pull upstream changes 
+[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone 
utility to incrementally pull upstream changes
 from varied sources such as DFS, Kafka and DB Changelogs and ingest them to 
hudi tables.  It runs as a spark application in two modes.
 
 To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark 
bundle are required, by adding
 `--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1`
 to the `spark-submit` command.
 
- - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion 
round which includes incrementally pulling events from upstream sources and 
ingesting them to hudi table. Background operations like cleaning old file 
versions and archiving hoodie timeline are automatically executed as part of 
the run. For Merge-On-Read tables, Compaction is also run inline as part of 
ingestion unless disabled by passing the flag "--disable-compaction". By 
default, Compaction is run inline for eve [...]
+- **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion round 
which includes incrementally pulling events from upstream sources and ingesting 
them to hudi table. Background operations like cleaning old file versions and 
archiving hoodie timeline are automatically executed as part of the run. For 
Merge-On-Read tables, Compaction is also run inline as part of ingestion unless 
disabled by passing the flag "--disable-compaction". By default, Compaction is 
run inline for ever [...]
 
 Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
 
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
 \
  --master yarn \
  --deploy-mode cluster \
@@ -81,12 +81,12 @@ Here is an example invocation for reading from kafka topic 
in a single-run mode
  --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
 ```
 
- - **Continuous Mode** :  Here, Hudi Streamer runs an infinite loop with each 
round performing one ingestion round as described in **Run Once Mode**. The 
frequency of data ingestion can be controlled by the configuration 
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in 
asynchronous fashion concurrently with ingestion unless disabled by passing the 
flag "--disable-compaction". Every ingestion run triggers a compaction request 
asynchronously and this frequency  [...]
+- **Continuous Mode** :  Here, Hudi Streamer runs an infinite loop with each 
round performing one ingestion round as described in **Run Once Mode**. The 
frequency of data ingestion can be controlled by the configuration 
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in 
asynchronous fashion concurrently with ingestion unless disabled by passing the 
flag "--disable-compaction". Every ingestion run triggers a compaction request 
asynchronously and this frequency c [...]
 
 Here is an example invocation for reading from kafka topic in a continuous 
mode and writing to Merge On Read table type in a yarn cluster.
 
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
 \
  --master yarn \
  --deploy-mode cluster \
@@ -133,7 +133,7 @@ Here is an example invocation for reading from kafka topic 
in a continuous mode
 
 ### Spark Datasource Writer Jobs
 
-As described in [Batch Writes](writing_data.md#spark-datasource-api), you can 
use spark datasource to ingest to hudi table. This mechanism allows you to 
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports 
spark streaming to ingest a streaming source to Hudi table. For Merge On Read 
table types, inline compaction is turned on by default which runs after every 
ingestion run. The compaction frequency can be changed by setting the property 
"hoodie.compact.inline.ma [...]
+As described in [Batch Writes](writing_data.md#spark-datasource-api), you can 
use spark datasource to ingest to hudi table. This mechanism allows you to 
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports 
spark streaming to ingest a streaming source to Hudi table. For Merge On Read 
table types, inline compaction is turned on by default which runs after every 
ingestion run. The compaction frequency can be changed by setting the property 
"hoodie.compact.inline.ma [...]
 
 Here is an example invocation using spark datasource
 
@@ -148,13 +148,13 @@ inputDF.write()
        .mode(SaveMode.Append)
        .save(basePath);
 ```
- 
-## Upgrading 
 
-New Hudi releases are listed on the [releases page](/releases/download), with 
detailed notes which list all the changes, with highlights in each release. 
+## Upgrading
+
+New Hudi releases are listed on the [releases page](/releases/download), with 
detailed notes which list all the changes, with highlights in each release.
 At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
 
-As general guidelines, 
+As general guidelines,
 
  - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and when we cannot, we will provide upgrade/downgrade 
tools via the CLI
  - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
@@ -175,10 +175,14 @@ following steps:
    0.x readers will continue to work; writers can also be readers and will 
continue to read both tv=6.
    a. Set `hoodie.write.auto.upgrade` to false.
    b. Set `hoodie.metadata.enable` to false.
+![upgrade1.0-1](/assets/images/upgrade-to-1.0/upgrade1.0-1.png)
 3. Upgrade table services to 1.x with tv=6, and resume operations.
+![upgrade1.0-2](/assets/images/upgrade-to-1.0/upgrade1.0-2.png)
 4. Upgrade all remaining readers to 1.x, with tv=6.
+![upgrade1.0-3](/assets/images/upgrade-to-1.0/upgrade1.0-3.png)
 5. Redeploy writers with tv=8; table services and readers will adapt/pick up 
tv=8 on the fly.
 6. Once all readers and writers are in 1.x, we are good to enable any new 
features, including metadata, with 1.x tables.
+![upgrade1.0-4](/assets/images/upgrade-to-1.0/upgrade1.0-4.png)
 
 During the upgrade, metadata table will not be updated and it will be behind 
the data table. It is important to note
 that metadata table will be updated only when the writer is upgraded to tv=8. 
So, even the readers should keep metadata
@@ -198,6 +202,7 @@ CLI to downgrade a table from a higher version to lower 
version. Let's consider
 0.12.0, upgrade it to 0.13.0 and then downgrade it via Hudi CLI.
 
 Launch spark shell with Hudi 0.11.0 version.
+
 ```shell
 spark-shell \
   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
@@ -207,6 +212,7 @@ spark-shell \
 ```
 
 Create a hudi table by using the scala script below.
+
 ```scala
 import org.apache.hudi.QuickstartUtils._
 import scala.collection.JavaConversions._
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png
new file mode 100644
index 000000000000..d9516b2138ea
Binary files /dev/null and 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png
new file mode 100644
index 000000000000..a619adaa6812
Binary files /dev/null and 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png
new file mode 100644
index 000000000000..6f52f37543c2
Binary files /dev/null and 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png
new file mode 100644
index 000000000000..481c294db547
Binary files /dev/null and 
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png differ
diff --git a/website/versioned_docs/version-1.1.1/deployment.md 
b/website/versioned_docs/version-1.1.1/deployment.md
index 9fa36d8181b4..51f92e40d407 100644
--- a/website/versioned_docs/version-1.1.1/deployment.md
+++ b/website/versioned_docs/version-1.1.1/deployment.md
@@ -6,19 +6,19 @@ toc: true
 last_modified_at: 2019-12-30T15:59:57-04:00
 ---
 
-This section provides all the help you need to deploy and operate Hudi tables 
at scale. 
+This section provides all the help you need to deploy and operate Hudi tables 
at scale.
 Specifically, we will cover the following aspects.
 
- - [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
- - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices.
- - [Downgrading Versions](#downgrading) : Reverting back to an older version 
of Hudi
- - [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
- 
+- [Deployment Model](#deploying) : How various Hudi components are deployed 
and managed.
+- [Upgrading Versions](#upgrading) : Picking up new releases of Hudi, 
guidelines and general best-practices.
+- [Downgrading Versions](#downgrading) : Reverting back to an older version of 
Hudi
+- [Migrating to Hudi](#migrating) : How to migrate your existing tables to 
Apache Hudi.
+
 ## Deploying
 
 All in all, Hudi deploys with no long running servers or additional 
infrastructure cost to your data lake. In fact, Hudi pioneered this model of 
building a transactional distributed storage layer
 using existing infrastructure and its heartening to see other systems adopting 
similar approaches as well. Hudi writing is done via Spark jobs (Hudi Streamer 
or custom Spark datasource jobs), deployed per standard Apache Spark 
[recommendations](https://spark.apache.org/docs/latest/cluster-overview).
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary. 
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache 
Spark or PrestoDB and hence no additional infrastructure is necessary.
 
 A typical Hudi data ingestion can be achieved in 2 modes. In a single run 
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and 
exits. In continuous mode, Hudi ingestion runs as a long-running service 
executing ingestion in a loop.
 
@@ -26,18 +26,18 @@ With Merge_On_Read Table, Hudi ingestion needs to also take 
care of compacting d
 
 ### Hudi Streamer
 
-[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone 
utility to incrementally pull upstream changes 
+[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone 
utility to incrementally pull upstream changes
 from varied sources such as DFS, Kafka and DB Changelogs and ingest them to 
hudi tables.  It runs as a spark application in two modes.
 
 To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark 
bundle are required, by adding
 `--packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1`
 to the `spark-submit` command.
 
- - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion 
round which includes incrementally pulling events from upstream sources and 
ingesting them to hudi table. Background operations like cleaning old file 
versions and archiving hoodie timeline are automatically executed as part of 
the run. For Merge-On-Read tables, Compaction is also run inline as part of 
ingestion unless disabled by passing the flag "--disable-compaction". By 
default, Compaction is run inline for eve [...]
+- **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion round 
which includes incrementally pulling events from upstream sources and ingesting 
them to hudi table. Background operations like cleaning old file versions and 
archiving hoodie timeline are automatically executed as part of the run. For 
Merge-On-Read tables, Compaction is also run inline as part of ingestion unless 
disabled by passing the flag "--disable-compaction". By default, Compaction is 
run inline for ever [...]
 
 Here is an example invocation for reading from kafka topic in a single-run 
mode and writing to Merge On Read table type in a yarn cluster.
 
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
 \
  --master yarn \
  --deploy-mode cluster \
@@ -81,12 +81,12 @@ Here is an example invocation for reading from kafka topic 
in a single-run mode
  --schemaprovider-class 
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
 ```
 
- - **Continuous Mode** :  Here, Hudi Streamer runs an infinite loop with each 
round performing one ingestion round as described in **Run Once Mode**. The 
frequency of data ingestion can be controlled by the configuration 
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in 
asynchronous fashion concurrently with ingestion unless disabled by passing the 
flag "--disable-compaction". Every ingestion run triggers a compaction request 
asynchronously and this frequency  [...]
+- **Continuous Mode** :  Here, Hudi Streamer runs an infinite loop with each 
round performing one ingestion round as described in **Run Once Mode**. The 
frequency of data ingestion can be controlled by the configuration 
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in 
asynchronous fashion concurrently with ingestion unless disabled by passing the 
flag "--disable-compaction". Every ingestion run triggers a compaction request 
asynchronously and this frequency c [...]
 
 Here is an example invocation for reading from kafka topic in a continuous 
mode and writing to Merge On Read table type in a yarn cluster.
 
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
  --packages 
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
 \
  --master yarn \
  --deploy-mode cluster \
@@ -133,7 +133,7 @@ Here is an example invocation for reading from kafka topic 
in a continuous mode
 
 ### Spark Datasource Writer Jobs
 
-As described in [Batch Writes](writing_data.md#spark-datasource-api), you can 
use spark datasource to ingest to hudi table. This mechanism allows you to 
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports 
spark streaming to ingest a streaming source to Hudi table. For Merge On Read 
table types, inline compaction is turned on by default which runs after every 
ingestion run. The compaction frequency can be changed by setting the property 
"hoodie.compact.inline.ma [...]
+As described in [Batch Writes](writing_data.md#spark-datasource-api), you can 
use spark datasource to ingest to hudi table. This mechanism allows you to 
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports 
spark streaming to ingest a streaming source to Hudi table. For Merge On Read 
table types, inline compaction is turned on by default which runs after every 
ingestion run. The compaction frequency can be changed by setting the property 
"hoodie.compact.inline.ma [...]
 
 Here is an example invocation using spark datasource
 
@@ -148,13 +148,13 @@ inputDF.write()
        .mode(SaveMode.Append)
        .save(basePath);
 ```
- 
-## Upgrading 
 
-New Hudi releases are listed on the [releases page](/releases/download), with 
detailed notes which list all the changes, with highlights in each release. 
+## Upgrading
+
+New Hudi releases are listed on the [releases page](/releases/download), with 
detailed notes which list all the changes, with highlights in each release.
 At the end of the day, Hudi is a storage system and with that comes a lot of 
responsibilities, which we take seriously. 
 
-As general guidelines, 
+As general guidelines,
 
  - We strive to keep all changes backwards compatible (i.e new code can read 
old data/timeline files) and when we cannot, we will provide upgrade/downgrade 
tools via the CLI
  - We cannot always guarantee forward compatibility (i.e old code being able 
to read data/timeline files written by a greater version). This is generally 
the norm, since no new features can be built otherwise.
@@ -175,10 +175,14 @@ following steps:
    0.x readers will continue to work; writers can also be readers and will 
continue to read both tv=6.
    a. Set `hoodie.write.auto.upgrade` to false.
    b. Set `hoodie.metadata.enable` to false.
+![upgrade1.0-1](/assets/images/upgrade-to-1.0/upgrade1.0-1.png)
 3. Upgrade table services to 1.x with tv=6, and resume operations.
+![upgrade1.0-2](/assets/images/upgrade-to-1.0/upgrade1.0-2.png)
 4. Upgrade all remaining readers to 1.x, with tv=6.
+![upgrade1.0-3](/assets/images/upgrade-to-1.0/upgrade1.0-3.png)
 5. Redeploy writers with tv=8; table services and readers will adapt/pick up 
tv=8 on the fly.
 6. Once all readers and writers are in 1.x, we are good to enable any new 
features, including metadata, with 1.x tables.
+![upgrade1.0-4](/assets/images/upgrade-to-1.0/upgrade1.0-4.png)
 
 During the upgrade, metadata table will not be updated and it will be behind 
the data table. It is important to note
 that metadata table will be updated only when the writer is upgraded to tv=8. 
So, even the readers should keep metadata
@@ -198,6 +202,7 @@ CLI to downgrade a table from a higher version to lower 
version. Let's consider
 0.12.0, upgrade it to 0.13.0 and then downgrade it via Hudi CLI.
 
 Launch spark shell with Hudi 0.11.0 version.
+
 ```shell
 spark-shell \
   --packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
@@ -207,6 +212,7 @@ spark-shell \
 ```
 
 Create a hudi table by using the scala script below.
+
 ```scala
 import org.apache.hudi.QuickstartUtils._
 import scala.collection.JavaConversions._

(hudi) branch asf-site updated: docs: update compaction, clustering, and deployment docs (#17886)

Reply via email to