This is an automated email from the ASF dual-hosted git repository.
xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 3abb2f10634d docs: update compaction, clustering, and deployment docs
(#17886)
3abb2f10634d is described below
commit 3abb2f10634dc3fa6a48d0e7b04b65f3ed5a0bf5
Author: Shiyan Xu <[email protected]>
AuthorDate: Thu Jan 15 00:46:32 2026 -0600
docs: update compaction, clustering, and deployment docs (#17886)
---
website/docs/clustering.md | 38 ++++++++++++++++++
website/docs/compaction.md | 20 ++++++++++
website/docs/deployment.md | 44 ++++++++++++---------
.../assets/images/upgrade-to-1.0/upgrade1.0-1.png | Bin 0 -> 69638 bytes
.../assets/images/upgrade-to-1.0/upgrade1.0-2.png | Bin 0 -> 68195 bytes
.../assets/images/upgrade-to-1.0/upgrade1.0-3.png | Bin 0 -> 66175 bytes
.../assets/images/upgrade-to-1.0/upgrade1.0-4.png | Bin 0 -> 67226 bytes
website/versioned_docs/version-1.1.1/deployment.md | 44 ++++++++++++---------
8 files changed, 108 insertions(+), 38 deletions(-)
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index 7c29da609dd3..5ec16c6bc3f7 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -238,6 +238,19 @@ The appropriate mode can be specified using `-mode` or
`-m` option. There are th
2. `execute`: Execute a clustering plan at a particular instant. If no
instant-time is specified, HoodieClusteringJob will execute for the earliest
instant on the Hudi timeline.
3. `scheduleAndExecute`: Make a clustering plan first and execute that plan
immediately.
+#### Available Options
+
+In addition to the basic mode options, HoodieClusteringJob supports the
following retry and timeout options (effective in `scheduleAndExecute` mode):
+
+| Option Name | Short Flag | Default | Description
|
+|--------------------------------|------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--retry-last-failed-job` | `-rc` | `false` | When set to true,
checks, rolls back, and executes the last failed clustering plan instead of
planning a new clustering job directly. This is useful for recovering from
previous failures.
|
+| `--job-max-processing-time-ms` | `-jt` | `0` | Maximum processing
time in milliseconds before considering a clustering job as failed. If this
time is exceeded and the job is still unfinished, Hudi will consider the job as
failed and relaunch it (when used with `--retry-last-failed-job`). A value of 0
or negative disables the timeout check. |
+
+:::note
+These retry options are only effective when using `--mode scheduleAndExecute`.
The `--retry-last-failed-job` option requires `--job-max-processing-time-ms` to
be set to a positive value to detect stale inflight instants.
+:::
+
Note that to run this job while the original writer is still running, please
enable multi-writing:
```properties
@@ -342,6 +355,31 @@ def structuredStreamingWithClustering(): Unit = {
}
```
+## Flink Offline Clustering
+
+Offline clustering for Flink needs to be submitted as a Flink job on the
command line. The program entry is in `hudi-flink-bundle.jar`:
`org.apache.hudi.sink.clustering.HoodieFlinkClusteringJob`
+
+```bash
+# Command line
+./bin/flink run -c org.apache.hudi.sink.clustering.HoodieFlinkClusteringJob
lib/hudi-flink-bundle.jar --path hdfs://xxx:9000/table
+```
+
+### Options
+
+| Option Name | Default | Description
|
+|-------------------------------------|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--path` | `n/a **(Required)**` | The path where
the target table is stored on Hudi
|
+| `--schedule` | `false` (Optional) | Whether to
execute the operation of scheduling clustering plan. When the write process is
still writing, turning on this parameter has a risk of losing data. Therefore,
it must be ensured that there are no write tasks currently writing data to this
table when this parameter is turned on |
+| `--service` | `false` (Optional) | Whether to
start a monitoring service that checks and schedules new clustering task in
configured interval.
|
+| `--min-clustering-interval-seconds` | `600(s)` (optional) | The checking
interval for service mode, by default 10 minutes.
|
+| `--retry` | `0` (Optional) | Number of
retries for clustering operation. Only effective in single-run mode (not
service mode). Default is 0 (no retry).
|
+| `--retry-last-failed-job` | `false` (Optional) | Check and retry
last failed clustering job if the inflight instant exceeds max processing time.
Only effective in single-run mode. Requires `--job-max-processing-time-ms` to
be set to a positive value.
|
+| `--job-max-processing-time-ms` | `0` (Optional) | Maximum
processing time in milliseconds before considering a clustering job as failed.
Used with `--retry-last-failed-job`. Default 0 means no timeout check.
|
+
+:::note
+The retry options (`--retry`, `--retry-last-failed-job`,
`--job-max-processing-time-ms`) are only effective in single-run mode, not in
service mode. Service mode has implicit retry semantics via its continuous
monitoring loop. A warning will be logged if `--retry-last-failed-job` is
enabled but `--job-max-processing-time-ms` is not set to a positive value.
+:::
+
## Java Client
Clustering is also supported via Java client. Plan strategy
`org.apache.hudi.client.clustering.plan.strategy.JavaSizeBasedClusteringPlanStrategy`
diff --git a/website/docs/compaction.md b/website/docs/compaction.md
index 55aae1f4697f..89b9214f0bd9 100644
--- a/website/docs/compaction.md
+++ b/website/docs/compaction.md
@@ -220,6 +220,19 @@ spark-submit --packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.2,or
Note, the `instant-time` parameter is now optional for the Hudi Compactor
Utility. If using the utility without `--instant time`,
the spark-submit will execute the earliest scheduled compaction on the Hudi
timeline.
+##### Available Options
+
+The HoodieCompactor utility supports the following retry and timeout options
(effective in `scheduleAndExecute` mode):
+
+| Option Name | Short Flag | Default | Description
|
+|--------------------------------|------------|---------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `--retry-last-failed-job` | `-rc` | `false` | When set to true,
checks, rolls back, and executes the last failed compaction plan instead of
planning a new compaction job directly. This is useful for recovering from
previous failures.
|
+| `--job-max-processing-time-ms` | `-jt` | `0` | Maximum processing
time in milliseconds before considering a compaction job as failed. If this
time is exceeded and the job is still unfinished, Hudi will consider the job as
failed and relaunch it (when used with `--retry-last-failed-job`). A value of 0
or negative disables the timeout check. |
+
+:::note
+These retry options are only effective when using `--mode scheduleAndExecute`.
The `--retry-last-failed-job` option requires `--job-max-processing-time-ms` to
be set to a positive value to detect stale inflight instants.
+:::
+
#### Hudi CLI
Hudi CLI is yet another way to execute specific compactions asynchronously.
Here is an example and you can read more in the [deployment
guide](cli.md#compactions)
@@ -251,6 +264,13 @@ Offline compaction needs to submit the Flink task on the
command line. The progr
| `--seq` | `LIFO` (Optional) | The order in
which compaction tasks are executed. Executing from the latest compaction plan
by default. `LIFO`: executing from the latest plan. `FIFO`: executing from the
oldest plan.
|
| `--service` | `false` (Optional) | Whether to
start a monitoring service that checks and schedules new compaction task in
configured interval.
|
| `--min-compaction-interval-seconds` | `600(s)` (optional) | The checking
interval for service mode, by default 10 minutes.
|
+| `--retry` | `0` (Optional) | Number of
retries for compaction operation. Only effective in single-run mode (not
service mode). Default is 0 (no retry).
|
+| `--retry-last-failed-job` | `false` (Optional) | Check and retry
last failed compaction job if the inflight instant exceeds max processing time.
Only effective in single-run mode. Requires `--job-max-processing-time-ms` to
be set to a positive value.
|
+| `--job-max-processing-time-ms` | `0` (Optional) | Maximum
processing time in milliseconds before considering a compaction job as failed.
Used with `--retry-last-failed-job`. Default 0 means no timeout check.
|
+
+:::note
+The retry options (`--retry`, `--retry-last-failed-job`,
`--job-max-processing-time-ms`) are only effective in single-run mode, not in
service mode. Service mode has implicit retry semantics via its continuous
monitoring loop. A warning will be logged if `--retry-last-failed-job` is
enabled but `--job-max-processing-time-ms` is not set to a positive value.
+:::
## Related Resources
diff --git a/website/docs/deployment.md b/website/docs/deployment.md
index 9fa36d8181b4..51f92e40d407 100644
--- a/website/docs/deployment.md
+++ b/website/docs/deployment.md
@@ -6,19 +6,19 @@ toc: true
last_modified_at: 2019-12-30T15:59:57-04:00
---
-This section provides all the help you need to deploy and operate Hudi tables
at scale.
+This section provides all the help you need to deploy and operate Hudi tables
at scale.
Specifically, we will cover the following aspects.
- - [Deployment Model](#deploying) : How various Hudi components are deployed
and managed.
- - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi,
guidelines and general best-practices.
- - [Downgrading Versions](#downgrading) : Reverting back to an older version
of Hudi
- - [Migrating to Hudi](#migrating) : How to migrate your existing tables to
Apache Hudi.
-
+- [Deployment Model](#deploying) : How various Hudi components are deployed
and managed.
+- [Upgrading Versions](#upgrading) : Picking up new releases of Hudi,
guidelines and general best-practices.
+- [Downgrading Versions](#downgrading) : Reverting back to an older version of
Hudi
+- [Migrating to Hudi](#migrating) : How to migrate your existing tables to
Apache Hudi.
+
## Deploying
All in all, Hudi deploys with no long running servers or additional
infrastructure cost to your data lake. In fact, Hudi pioneered this model of
building a transactional distributed storage layer
using existing infrastructure and its heartening to see other systems adopting
similar approaches as well. Hudi writing is done via Spark jobs (Hudi Streamer
or custom Spark datasource jobs), deployed per standard Apache Spark
[recommendations](https://spark.apache.org/docs/latest/cluster-overview).
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or PrestoDB and hence no additional infrastructure is necessary.
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or PrestoDB and hence no additional infrastructure is necessary.
A typical Hudi data ingestion can be achieved in 2 modes. In a single run
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and
exits. In continuous mode, Hudi ingestion runs as a long-running service
executing ingestion in a loop.
@@ -26,18 +26,18 @@ With Merge_On_Read Table, Hudi ingestion needs to also take
care of compacting d
### Hudi Streamer
-[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone
utility to incrementally pull upstream changes
+[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone
utility to incrementally pull upstream changes
from varied sources such as DFS, Kafka and DB Changelogs and ingest them to
hudi tables. It runs as a spark application in two modes.
To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark
bundle are required, by adding
`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1`
to the `spark-submit` command.
- - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
+- **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion round
which includes incrementally pulling events from upstream sources and ingesting
them to hudi table. Background operations like cleaning old file versions and
archiving hoodie timeline are automatically executed as part of the run. For
Merge-On-Read tables, Compaction is also run inline as part of ingestion unless
disabled by passing the flag "--disable-compaction". By default, Compaction is
run inline for ever [...]
Here is an example invocation for reading from kafka topic in a single-run
mode and writing to Merge On Read table type in a yarn cluster.
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
\
--master yarn \
--deploy-mode cluster \
@@ -81,12 +81,12 @@ Here is an example invocation for reading from kafka topic
in a single-run mode
--schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
```
- - **Continuous Mode** : Here, Hudi Streamer runs an infinite loop with each
round performing one ingestion round as described in **Run Once Mode**. The
frequency of data ingestion can be controlled by the configuration
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in
asynchronous fashion concurrently with ingestion unless disabled by passing the
flag "--disable-compaction". Every ingestion run triggers a compaction request
asynchronously and this frequency [...]
+- **Continuous Mode** : Here, Hudi Streamer runs an infinite loop with each
round performing one ingestion round as described in **Run Once Mode**. The
frequency of data ingestion can be controlled by the configuration
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in
asynchronous fashion concurrently with ingestion unless disabled by passing the
flag "--disable-compaction". Every ingestion run triggers a compaction request
asynchronously and this frequency c [...]
Here is an example invocation for reading from kafka topic in a continuous
mode and writing to Merge On Read table type in a yarn cluster.
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
\
--master yarn \
--deploy-mode cluster \
@@ -133,7 +133,7 @@ Here is an example invocation for reading from kafka topic
in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Batch Writes](writing_data.md#spark-datasource-api), you can
use spark datasource to ingest to hudi table. This mechanism allows you to
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports
spark streaming to ingest a streaming source to Hudi table. For Merge On Read
table types, inline compaction is turned on by default which runs after every
ingestion run. The compaction frequency can be changed by setting the property
"hoodie.compact.inline.ma [...]
+As described in [Batch Writes](writing_data.md#spark-datasource-api), you can
use spark datasource to ingest to hudi table. This mechanism allows you to
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports
spark streaming to ingest a streaming source to Hudi table. For Merge On Read
table types, inline compaction is turned on by default which runs after every
ingestion run. The compaction frequency can be changed by setting the property
"hoodie.compact.inline.ma [...]
Here is an example invocation using spark datasource
@@ -148,13 +148,13 @@ inputDF.write()
.mode(SaveMode.Append)
.save(basePath);
```
-
-## Upgrading
-New Hudi releases are listed on the [releases page](/releases/download), with
detailed notes which list all the changes, with highlights in each release.
+## Upgrading
+
+New Hudi releases are listed on the [releases page](/releases/download), with
detailed notes which list all the changes, with highlights in each release.
At the end of the day, Hudi is a storage system and with that comes a lot of
responsibilities, which we take seriously.
-As general guidelines,
+As general guidelines,
- We strive to keep all changes backwards compatible (i.e new code can read
old data/timeline files) and when we cannot, we will provide upgrade/downgrade
tools via the CLI
- We cannot always guarantee forward compatibility (i.e old code being able
to read data/timeline files written by a greater version). This is generally
the norm, since no new features can be built otherwise.
@@ -175,10 +175,14 @@ following steps:
0.x readers will continue to work; writers can also be readers and will
continue to read both tv=6.
a. Set `hoodie.write.auto.upgrade` to false.
b. Set `hoodie.metadata.enable` to false.
+
3. Upgrade table services to 1.x with tv=6, and resume operations.
+
4. Upgrade all remaining readers to 1.x, with tv=6.
+
5. Redeploy writers with tv=8; table services and readers will adapt/pick up
tv=8 on the fly.
6. Once all readers and writers are in 1.x, we are good to enable any new
features, including metadata, with 1.x tables.
+
During the upgrade, metadata table will not be updated and it will be behind
the data table. It is important to note
that metadata table will be updated only when the writer is upgraded to tv=8.
So, even the readers should keep metadata
@@ -198,6 +202,7 @@ CLI to downgrade a table from a higher version to lower
version. Let's consider
0.12.0, upgrade it to 0.13.0 and then downgrade it via Hudi CLI.
Launch spark shell with Hudi 0.11.0 version.
+
```shell
spark-shell \
--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
@@ -207,6 +212,7 @@ spark-shell \
```
Create a hudi table by using the scala script below.
+
```scala
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png
new file mode 100644
index 000000000000..d9516b2138ea
Binary files /dev/null and
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-1.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png
new file mode 100644
index 000000000000..a619adaa6812
Binary files /dev/null and
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-2.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png
new file mode 100644
index 000000000000..6f52f37543c2
Binary files /dev/null and
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-3.png differ
diff --git a/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png
new file mode 100644
index 000000000000..481c294db547
Binary files /dev/null and
b/website/static/assets/images/upgrade-to-1.0/upgrade1.0-4.png differ
diff --git a/website/versioned_docs/version-1.1.1/deployment.md
b/website/versioned_docs/version-1.1.1/deployment.md
index 9fa36d8181b4..51f92e40d407 100644
--- a/website/versioned_docs/version-1.1.1/deployment.md
+++ b/website/versioned_docs/version-1.1.1/deployment.md
@@ -6,19 +6,19 @@ toc: true
last_modified_at: 2019-12-30T15:59:57-04:00
---
-This section provides all the help you need to deploy and operate Hudi tables
at scale.
+This section provides all the help you need to deploy and operate Hudi tables
at scale.
Specifically, we will cover the following aspects.
- - [Deployment Model](#deploying) : How various Hudi components are deployed
and managed.
- - [Upgrading Versions](#upgrading) : Picking up new releases of Hudi,
guidelines and general best-practices.
- - [Downgrading Versions](#downgrading) : Reverting back to an older version
of Hudi
- - [Migrating to Hudi](#migrating) : How to migrate your existing tables to
Apache Hudi.
-
+- [Deployment Model](#deploying) : How various Hudi components are deployed
and managed.
+- [Upgrading Versions](#upgrading) : Picking up new releases of Hudi,
guidelines and general best-practices.
+- [Downgrading Versions](#downgrading) : Reverting back to an older version of
Hudi
+- [Migrating to Hudi](#migrating) : How to migrate your existing tables to
Apache Hudi.
+
## Deploying
All in all, Hudi deploys with no long running servers or additional
infrastructure cost to your data lake. In fact, Hudi pioneered this model of
building a transactional distributed storage layer
using existing infrastructure and its heartening to see other systems adopting
similar approaches as well. Hudi writing is done via Spark jobs (Hudi Streamer
or custom Spark datasource jobs), deployed per standard Apache Spark
[recommendations](https://spark.apache.org/docs/latest/cluster-overview).
-Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or PrestoDB and hence no additional infrastructure is necessary.
+Querying Hudi tables happens via libraries installed into Apache Hive, Apache
Spark or PrestoDB and hence no additional infrastructure is necessary.
A typical Hudi data ingestion can be achieved in 2 modes. In a single run
mode, Hudi ingestion reads next batch of data, ingest them to Hudi table and
exits. In continuous mode, Hudi ingestion runs as a long-running service
executing ingestion in a loop.
@@ -26,18 +26,18 @@ With Merge_On_Read Table, Hudi ingestion needs to also take
care of compacting d
### Hudi Streamer
-[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone
utility to incrementally pull upstream changes
+[Hudi Streamer](hoodie_streaming_ingestion.md#hudi-streamer) is the standalone
utility to incrementally pull upstream changes
from varied sources such as DFS, Kafka and DB Changelogs and ingest them to
hudi tables. It runs as a spark application in two modes.
To use Hudi Streamer in Spark, the `hudi-utilities-slim-bundle` and Hudi Spark
bundle are required, by adding
`--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1`
to the `spark-submit` command.
- - **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion
round which includes incrementally pulling events from upstream sources and
ingesting them to hudi table. Background operations like cleaning old file
versions and archiving hoodie timeline are automatically executed as part of
the run. For Merge-On-Read tables, Compaction is also run inline as part of
ingestion unless disabled by passing the flag "--disable-compaction". By
default, Compaction is run inline for eve [...]
+- **Run Once Mode** : In this mode, Hudi Streamer performs one ingestion round
which includes incrementally pulling events from upstream sources and ingesting
them to hudi table. Background operations like cleaning old file versions and
archiving hoodie timeline are automatically executed as part of the run. For
Merge-On-Read tables, Compaction is also run inline as part of ingestion unless
disabled by passing the flag "--disable-compaction". By default, Compaction is
run inline for ever [...]
Here is an example invocation for reading from kafka topic in a single-run
mode and writing to Merge On Read table type in a yarn cluster.
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
\
--master yarn \
--deploy-mode cluster \
@@ -81,12 +81,12 @@ Here is an example invocation for reading from kafka topic
in a single-run mode
--schemaprovider-class
org.apache.hudi.utilities.schema.FilebasedSchemaProvider
```
- - **Continuous Mode** : Here, Hudi Streamer runs an infinite loop with each
round performing one ingestion round as described in **Run Once Mode**. The
frequency of data ingestion can be controlled by the configuration
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in
asynchronous fashion concurrently with ingestion unless disabled by passing the
flag "--disable-compaction". Every ingestion run triggers a compaction request
asynchronously and this frequency [...]
+- **Continuous Mode** : Here, Hudi Streamer runs an infinite loop with each
round performing one ingestion round as described in **Run Once Mode**. The
frequency of data ingestion can be controlled by the configuration
"--min-sync-interval-seconds". For Merge-On-Read tables, Compaction is run in
asynchronous fashion concurrently with ingestion unless disabled by passing the
flag "--disable-compaction". Every ingestion run triggers a compaction request
asynchronously and this frequency c [...]
Here is an example invocation for reading from kafka topic in a continuous
mode and writing to Merge On Read table type in a yarn cluster.
-```java
-[hoodie]$ spark-submit \
+```shell
+spark-submit \
--packages
org.apache.hudi:hudi-utilities-slim-bundle_2.12:1.0.1,org.apache.hudi:hudi-spark3.5-bundle_2.12:1.0.1
\
--master yarn \
--deploy-mode cluster \
@@ -133,7 +133,7 @@ Here is an example invocation for reading from kafka topic
in a continuous mode
### Spark Datasource Writer Jobs
-As described in [Batch Writes](writing_data.md#spark-datasource-api), you can
use spark datasource to ingest to hudi table. This mechanism allows you to
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports
spark streaming to ingest a streaming source to Hudi table. For Merge On Read
table types, inline compaction is turned on by default which runs after every
ingestion run. The compaction frequency can be changed by setting the property
"hoodie.compact.inline.ma [...]
+As described in [Batch Writes](writing_data.md#spark-datasource-api), you can
use spark datasource to ingest to hudi table. This mechanism allows you to
ingest any spark dataframe in Hudi format. Hudi Spark DataSource also supports
spark streaming to ingest a streaming source to Hudi table. For Merge On Read
table types, inline compaction is turned on by default which runs after every
ingestion run. The compaction frequency can be changed by setting the property
"hoodie.compact.inline.ma [...]
Here is an example invocation using spark datasource
@@ -148,13 +148,13 @@ inputDF.write()
.mode(SaveMode.Append)
.save(basePath);
```
-
-## Upgrading
-New Hudi releases are listed on the [releases page](/releases/download), with
detailed notes which list all the changes, with highlights in each release.
+## Upgrading
+
+New Hudi releases are listed on the [releases page](/releases/download), with
detailed notes which list all the changes, with highlights in each release.
At the end of the day, Hudi is a storage system and with that comes a lot of
responsibilities, which we take seriously.
-As general guidelines,
+As general guidelines,
- We strive to keep all changes backwards compatible (i.e new code can read
old data/timeline files) and when we cannot, we will provide upgrade/downgrade
tools via the CLI
- We cannot always guarantee forward compatibility (i.e old code being able
to read data/timeline files written by a greater version). This is generally
the norm, since no new features can be built otherwise.
@@ -175,10 +175,14 @@ following steps:
0.x readers will continue to work; writers can also be readers and will
continue to read both tv=6.
a. Set `hoodie.write.auto.upgrade` to false.
b. Set `hoodie.metadata.enable` to false.
+
3. Upgrade table services to 1.x with tv=6, and resume operations.
+
4. Upgrade all remaining readers to 1.x, with tv=6.
+
5. Redeploy writers with tv=8; table services and readers will adapt/pick up
tv=8 on the fly.
6. Once all readers and writers are in 1.x, we are good to enable any new
features, including metadata, with 1.x tables.
+
During the upgrade, metadata table will not be updated and it will be behind
the data table. It is important to note
that metadata table will be updated only when the writer is upgraded to tv=8.
So, even the readers should keep metadata
@@ -198,6 +202,7 @@ CLI to downgrade a table from a higher version to lower
version. Let's consider
0.12.0, upgrade it to 0.13.0 and then downgrade it via Hudi CLI.
Launch spark shell with Hudi 0.11.0 version.
+
```shell
spark-shell \
--packages org.apache.hudi:hudi-spark3.2-bundle_2.12:0.11.0 \
@@ -207,6 +212,7 @@ spark-shell \
```
Create a hudi table by using the scala script below.
+
```scala
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._