This is an automated email from the ASF dual-hosted git repository.
sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 19734b58a62 [DOCS] Updated inline and async process with more details
(#10664)
19734b58a62 is described below
commit 19734b58a6230930fea2ad1bb748d5d6f6bb24c7
Author: nadine farah <[email protected]>
AuthorDate: Fri Mar 8 10:16:59 2024 -0800
[DOCS] Updated inline and async process with more details (#10664)
---
website/docs/clustering.md | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)
diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index f61c61a4476..149b690ff3b 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -170,9 +170,14 @@ for inline or async clustering are shown below with code
samples.
## Inline clustering
-Inline clustering happens synchronously with the regular ingestion writer,
which means the next round of ingestion
-cannot proceed until the clustering is complete. Inline clustering can be
setup easily using spark dataframe options.
-See sample below
+Inline clustering happens synchronously with the regular ingestion writer or
as part of the data ingestion pipeline. This means the next round of ingestion
cannot proceed until the clustering is complete With inline clustering, Hudi
will schedule, plan clustering operations after each commit is completed and
execute the clustering plans after it’s created. This is the simplest
deployment model to run because it’s easier to manage than running different
asynchronous Spark jobs. This mode [...]
+
+For this deployment mode, please enable and set: `hoodie.clustering.inline`
+
+To choose how often clustering is triggered, also set:
`hoodie.clustering.inline.max.commits`.
+
+Inline clustering can be setup easily using spark dataframe options.
+See sample below:
```scala
import org.apache.hudi.QuickstartUtils._
@@ -202,7 +207,12 @@ df.write.format("org.apache.hudi").
## Async Clustering
-Async clustering runs the clustering table service in the background without
blocking the regular ingestions writers.
+Async clustering runs the clustering table service in the background without
blocking the regular ingestions writers. There are three different ways to
deploy an asynchronous clustering process:
+
+- **Asynchronous execution within the same process**: In this deployment mode,
Hudi will schedule and plan the clustering operations after each commit is
completed as part of the ingestion pipeline. Separately, Hudi spins up another
thread within the same job and executes the clustering table service. This is
supported by Spark Streaming, Flink and DeltaStreamer in continuous mode. For
this deployment mode, please enable `hoodie.clustering.async.enabled` and
`hoodie.clustering.async.max. [...]
+- **Asynchronous scheduling and execution by a separate process**: In this
deployment mode, the application will write data to a Hudi table as part of the
ingestion pipeline. A separate clustering job will schedule, plan and execute
the clustering operation. By running a different job for the clustering
operation, it rebalances how Hudi uses compute resources: fewer compute
resources are needed for the ingestion, which makes ingestion latency stable,
and an independent set of compute res [...]
+- **Scheduling inline and executing async**: In this deployment mode, the
application ingests data and schedules the clustering in one job; in another,
the application executes the clustering plan. The supported writers (see below)
won’t be blocked from ingesting data. If the metadata table is enabled, a lock
provider is not needed. However, if the metadata table is enabled, please
ensure all jobs have the lock providers configured for concurrency control. All
writers support this deploy [...]
+
Hudi supports
[multi-writers](https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing)
which provides
snapshot isolation between multiple table services, thus allowing writers to
continue with ingestion while clustering
runs in the background.