This is an automated email from the ASF dual-hosted git repository.

sivabalan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 19734b58a62 [DOCS] Updated inline and async process with more details 
(#10664)
19734b58a62 is described below

commit 19734b58a6230930fea2ad1bb748d5d6f6bb24c7
Author: nadine farah <[email protected]>
AuthorDate: Fri Mar 8 10:16:59 2024 -0800

    [DOCS] Updated inline and async process with more details (#10664)
---
 website/docs/clustering.md | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/website/docs/clustering.md b/website/docs/clustering.md
index f61c61a4476..149b690ff3b 100644
--- a/website/docs/clustering.md
+++ b/website/docs/clustering.md
@@ -170,9 +170,14 @@ for inline or async clustering are shown below with code 
samples.
 
 ## Inline clustering
 
-Inline clustering happens synchronously with the regular ingestion writer, 
which means the next round of ingestion
-cannot proceed until the clustering is complete. Inline clustering can be 
setup easily using spark dataframe options.
-See sample below
+Inline clustering happens synchronously with the regular ingestion writer or 
as part of the data ingestion pipeline. This means the next round of ingestion 
cannot proceed until the clustering is complete With inline clustering, Hudi 
will schedule, plan clustering operations after each commit is completed and 
execute the clustering plans after it’s created. This is the simplest 
deployment model to run because it’s easier to manage than running different 
asynchronous Spark jobs. This mode  [...]
+
+For this deployment mode, please enable and set: `hoodie.clustering.inline` 
+
+To choose how often clustering is triggered, also set: 
`hoodie.clustering.inline.max.commits`. 
+
+Inline clustering can be setup easily using spark dataframe options.
+See sample below:
 
 ```scala
 import org.apache.hudi.QuickstartUtils._
@@ -202,7 +207,12 @@ df.write.format("org.apache.hudi").
 
 ## Async Clustering
 
-Async clustering runs the clustering table service in the background without 
blocking the regular ingestions writers.
+Async clustering runs the clustering table service in the background without 
blocking the regular ingestions writers. There are three different ways to 
deploy an asynchronous clustering process: 
+
+- **Asynchronous execution within the same process**: In this deployment mode, 
Hudi will schedule and plan the clustering operations after each commit is 
completed as part of the ingestion pipeline. Separately, Hudi spins up another 
thread within the same job and executes the clustering table service. This is 
supported by Spark Streaming, Flink and DeltaStreamer in continuous mode. For 
this deployment mode, please enable `hoodie.clustering.async.enabled` and 
`hoodie.clustering.async.max. [...]
+- **Asynchronous scheduling and execution by a separate process**: In this 
deployment mode, the application will write data to a Hudi table as part of the 
ingestion pipeline. A separate clustering job will schedule, plan and execute 
the clustering operation. By running a different job for the clustering 
operation, it rebalances how Hudi uses compute resources: fewer compute 
resources are needed for the ingestion, which makes ingestion latency stable, 
and an independent set of compute res [...]
+- **Scheduling inline and executing async**: In this deployment mode, the 
application ingests data and schedules the clustering in one job; in another, 
the application executes the clustering plan. The supported writers (see below) 
won’t be blocked from ingesting data. If the metadata table is enabled, a lock 
provider is not needed. However, if the metadata table is enabled, please 
ensure all jobs have the lock providers configured for concurrency control. All 
writers support this deploy [...]
+
 Hudi supports 
[multi-writers](https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing)
 which provides
 snapshot isolation between multiple table services, thus allowing writers to 
continue with ingestion while clustering
 runs in the background.

Reply via email to