(hudi) branch asf-site updated: docs: update lock provider info (#14325)

xushiyan Sun, 23 Nov 2025 15:05:20 -0800

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 1fb7fbf1ffb2 docs: update lock provider info (#14325)
1fb7fbf1ffb2 is described below

commit 1fb7fbf1ffb2e35583010ebb4b6be146d3481604
Author: Shiyan Xu <[email protected]>
AuthorDate: Sun Nov 23 17:05:06 2025 -0600

    docs: update lock provider info (#14325)
---
 AGENTS.md                           |   2 +-
 website/docs/concurrency_control.md | 254 ++++++++++++++++++------------------
 website/docs/record_merger.md       |   2 +-
 3 files changed, 132 insertions(+), 126 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index e8d5285fcafb..f2ce3d08d736 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -65,4 +65,4 @@ Your post is now ready for review.
 
 The `website/docs` directory contains documentation pages for the Hudi 
project, which are versioned against releases.
 
-When updating a markdown file in the hudi website repo, if the file contains a 
`last_modified_at` field in its front matter, update it to the current date 
when modifying the file.
+When updating a markdown file in the hudi website repo, if the file contains a 
`last_modified_at` field in its front matter, update it to the current 
timestamp in ISO 8601 format (YYYY-MM-DDTHH:MM:SS) when modifying the file.
diff --git a/website/docs/concurrency_control.md 
b/website/docs/concurrency_control.md
index 6a610c572967..511516b5df0e 100644
--- a/website/docs/concurrency_control.md
+++ b/website/docs/concurrency_control.md
@@ -1,131 +1,145 @@
 ---
 title: "Concurrency Control"
-summary: In this page, we will discuss how to perform concurrent writes to 
Hudi Tables.
+summary: On this page, we discuss how to perform concurrent writes to Hudi 
tables.
 toc: true
 toc_min_heading_level: 2
 toc_max_heading_level: 4
-last_modified_at: 2021-03-19T15:59:57-04:00
+last_modified_at: 2025-11-23T14:20:00
 ---
-Concurrency control defines how different writers/readers/table services 
coordinate access to a Hudi table. Hudi ensures atomic writes, by way of 
publishing commits atomically to the timeline, 
-stamped with an instant time that denotes the time at which the action is 
deemed to have occurred. Unlike general purpose file version control, Hudi 
draws clear distinction between 
-writer processes that issue [write operations](write_operations) and table 
services that (re)write data/metadata to optimize/perform bookkeeping and 
-readers (that execute queries and read data). 
 
-Hudi provides
+Concurrency control defines how different writers, readers, and table services 
coordinate access to a Hudi table. Hudi ensures atomic writes by publishing 
commits atomically to the timeline, stamped with an instant time that denotes 
when the action is deemed to have occurred. Unlike general-purpose file version 
control, Hudi draws a clear distinction between writer processes that issue 
[write operations](write_operations), table services that (re)write 
data/metadata to optimize or perfor [...]
+
+Hudi provides:
+
 * **Snapshot isolation** between all three types of processes, meaning they 
all operate on a consistent snapshot of the table.
-* **Optimistic concurrency control (OCC)** between writers to provide standard 
relational database semantics.
-* **Multiversion Concurrency Control (MVCC)** based concurrency control 
between writers and table-services and between different table services.
-* **Non-blocking Concurrency Control (NBCC)** between writers, to provide 
streaming semantics and avoiding live-locks/starvation between writers.
+* **Optimistic Concurrency Control (OCC)** between writers to provide standard 
relational database semantics.
+* **Multiversion Concurrency Control (MVCC)** between writers and table 
services and between different table services.
+* **Non-Blocking Concurrency Control (NBCC)** between writers to provide 
streaming semantics and avoid live-locks/starvation between writers.
 
-In this section, we will discuss the different concurrency controls supported 
by Hudi and how they are leveraged to provide flexible deployment models for 
single and multiple writer scenarios. 
-We’ll also describe ways to ingest data into a Hudi Table from multiple 
writers using different writers, like  Hudi Streamer, Hudi datasource, Spark 
Structured Streaming and Spark SQL.
+In this section, we discuss the different concurrency controls supported by 
Hudi and how they enable flexible deployment models for single- and 
multi-writer scenarios. We also describe ways to ingest data into a Hudi table 
from multiple writers using components like Hudi Streamer, the Hudi DataSource, 
Spark Structured Streaming, and Spark SQL.
 
 :::note
-If there is only one process performing writing AND async/inline table 
services on the table, you can
-avoid the overhead of a distributed lock requirement by configuring the in 
process lock provider.
+If there is only one process performing writing AND async/inline table 
services on the table, you can avoid the overhead of a distributed lock 
requirement by configuring the in-process lock provider.
 
 ```properties
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider
 ```
+
 :::
 
-## Distributed Locking 
-A pre-requisite for distributed co-ordination in Hudi, like many other 
distributed database systems is a distributed lock provider, that different 
processes can use to plan, schedule and 
-execute actions on the Hudi timeline in a concurrent fashion. Locks are also 
used to [generate 
TrueTime](https://hudi.apache.org/docs/timeline#truetime-generation), as 
discussed before.
+## Distributed Locking
 
-External locking is typically used in conjunction with optimistic concurrency 
control 
-because it provides a way to prevent conflicts that might occur when two or 
more transactions (commits in our case) attempt to modify the same resource 
concurrently. 
-When a transaction attempts to modify a resource that is currently locked by 
another transaction, it must wait until the lock is released before proceeding.
+A prerequisite for distributed coordination in Hudi, like many other 
distributed systems, is a distributed lock provider that processes can use to 
plan, schedule, and execute actions on the Hudi timeline concurrently. Locks 
are also used to [generate TrueTime](timeline#truetime-generation).
 
-In case of multi-writing in Hudi, the locks are acquired on the Hudi table for 
a very short duration during specific phases (such as just before committing 
the writes or before scheduling table services) instead of locking for the 
entire span of time. This approach allows multiple writers to work on the same 
table simultaneously, increasing concurrency and avoids conflicts.
+External locking is typically used with optimistic concurrency control because 
it prevents conflicts that occur when two or more commits attempt to modify the 
same resource concurrently. When a transaction attempts to modify a locked 
resource, it must wait until the lock is released.
 
-There are 4 different lock providers that require different configurations to 
be set. 1 lock provider requires no configurations. Please refer to 
comprehensive locking configs 
[here](https://hudi.apache.org/docs/next/configurations#LOCK).
+In multi-writing scenarios, locks are acquired on the Hudi table for very 
short periods during specific phases (such as just before committing writes or 
before scheduling table services) instead of for the entire job duration. This 
approach allows multiple writers to work on the same table simultaneously, 
increasing concurrency and avoiding conflicts.
 
-#### Storage with conditional writes based
-```
+Hudi provides multiple lock providers requiring different configurations. 
Refer to the [locking configs](configurations#LOCK) for the complete list.
+
+### Storage-Based Lock Provider
+
+The storage-based lock provider eliminates the need for external lock 
infrastructure by managing concurrency directly using the `.hoodie/` directory 
in your table's storage layer. This removes dependencies on DynamoDB, 
ZooKeeper, or Hive Metastore, reducing operational complexity and cost.
+
+It uses conditional writes on a single lock file so only one writer holds the 
lock at a time, with heartbeat-based renewal and automatic expiration for fault 
tolerance. Basic setup requires only setting the provider class name, though 
optional parameters allow tuning lock validity and renewal intervals.
+
+Add the corresponding cloud bundle to your classpath:
+
+* For S3: `hudi-aws-bundle`
+* For GCS: `hudi-gcp-bundle`
+
+Set this configuration:
+
+```properties
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.StorageBasedLockProvider
 ```
-No configs are required for this lock provider, however as of 1.0.2 this lock 
provider is only supported for S3.
-All writers compete to acquire a lock using the file 
`.hoodie/.locks/table_lock.json`.
 
-#### Zookeeper based
-```
+Supported for S3 and GCS (additional systems planned). This cloud-native 
design works directly with storage features, simplifying large-scale cloud 
operations.
+
+Optional tuning configurations:
+
+| Config Name                                     | Default        | 
Description                                                                     
                                                                                
                                                     |
+|-------------------------------------------------|----------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.write.lock.storage.validity.timeout.secs | 300 (Optional) | Validity 
period (seconds) for each new lock. The provider renews its lock until the 
lease extends or timeout occurs.<br /><br />`Config Param: 
STORAGE_BASED_LOCK_VALIDITY_TIMEOUT_SECS`<br />`Since Version: 1.0.2` |
+| hoodie.write.lock.storage.renew.interval.secs   | 30 (Optional)  | Interval 
(seconds) between renewal attempts.<br /><br />`Config Param: 
STORAGE_BASED_LOCK_RENEW_INTERVAL_SECS`<br />`Since Version: 1.0.2`             
                                                              |
+
+### Zookeeper-Based Lock Provider
+
+```properties
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider
 ```
-Following are the basic configs required to setup this lock provider:
 
-| Config Name| Default| Description                                            
                                                                                
                                                                    |
-| ---------------------------------------------------------------------------- 
| ------------------------ 
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| hoodie.write.lock.zookeeper.base_path   | N/A **(Required)**   | The base 
path on Zookeeper under which to create lock related ZNodes. This should be 
same for all concurrent writers to the same table<br /><br />`Config Param: 
ZK_BASE_PATH`<br />`Since Version: 0.8.0` |
-| hoodie.write.lock.zookeeper.port        | N/A **(Required)**   | Zookeeper 
port to connect to.<br /><br />`Config Param: ZK_PORT`<br />`Since Version: 
0.8.0`                                                                          
                                     |
-| hoodie.write.lock.zookeeper.url         | N/A **(Required)**   | Zookeeper 
URL to connect to.<br /><br />`Config Param: ZK_CONNECT_URL`<br />`Since 
Version: 0.8.0`                                                                 
                                        |
+Basic configuration:
 
-#### HiveMetastore based
+| Config Name                           | Default            | Description     
                                                                                
                                                                                
   |
+|---------------------------------------|--------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.write.lock.zookeeper.base_path | N/A **(Required)** | Base path on 
ZooKeeper under which to create lock-related ZNodes. Must be the same for all 
concurrent writers.<br /><br />`Config Param: ZK_BASE_PATH`<br />`Since 
Version: 0.8.0` |
+| hoodie.write.lock.zookeeper.port      | N/A **(Required)** | ZooKeeper 
port.<br /><br />`Config Param: ZK_PORT`<br />`Since Version: 0.8.0`            
                                                                                
         |
+| hoodie.write.lock.zookeeper.url       | N/A **(Required)** | ZooKeeper 
connection URL.<br /><br />`Config Param: ZK_CONNECT_URL`<br />`Since Version: 
0.8.0`                                                                          
          |
 
-```
+### HiveMetastore-Based Lock Provider
+
+```properties
 
hoodie.write.lock.provider=org.apache.hudi.hive.transaction.lock.HiveMetastoreBasedLockProvider
 ```
-Following are the basic configs required to setup this lock provider:
 
-| Config Name| Default| Description                                            
                                                                                
                                                                    |
-| ----------------------------------------------------------------------- | 
------------------------ 
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| hoodie.write.lock.hivemetastore.database                     | N/A 
**(Required)**                                                                  
               | For Hive based lock provider, the Hive database to acquire 
lock against<br /><br />`Config Param: HIVE_DATABASE_NAME`<br />`Since Version: 
0.8.0`                                                                          
                                                          |
-| hoodie.write.lock.hivemetastore.table                         | N/A 
**(Required)**                                                                  
               | For Hive based lock provider, the Hive table to acquire lock 
against<br /><br />`Config Param: HIVE_TABLE_NAME`<br />`Since Version: 0.8.0`  
                                                                                
                                                        |
+Basic configuration:
+
+| Config Name                              | Default            | Description  
                                                                                
                    |
+|------------------------------------------|--------------------|------------------------------------------------------------------------------------------------------------------|
+| hoodie.write.lock.hivemetastore.database | N/A **(Required)** | Hive 
database to acquire lock against.<br /><br />`Config Param: 
HIVE_DATABASE_NAME`<br />`Since Version: 0.8.0` |
+| hoodie.write.lock.hivemetastore.table    | N/A **(Required)** | Hive table 
to acquire lock against.<br /><br />`Config Param: HIVE_TABLE_NAME`<br />`Since 
Version: 0.8.0`       |
 
-`The HiveMetastore URI's are picked up from the hadoop configuration file 
loaded during runtime.`
+Hive Metastore URIs are picked up from the Hadoop configuration at runtime.
 
 :::note
-Zookeeper is required for HMS lock provider. Users can set the zookeeper 
configs for Hive using Hudi
+ZooKeeper is required for the HMS lock provider. Set ZooKeeper configs via 
Hudi:
 
 ```properties
 hoodie.write.lock.zookeeper.url
 hoodie.write.lock.zookeeper.port
 ```
-or via Hive config for Hive Metastore (HMS)
+
+Or via Hive configuration:
 
 ```properties
 hive.zookeeper.quorum
 hive.zookeeper.client.port
 ```
+
 :::
 
-#### Amazon DynamoDB based
-```
+### DynamoDB-Based Lock Provider
+
+```properties
 
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
 ```
-Amazon DynamoDB based lock provides a simple way to support multi writing 
across different clusters.  You can refer to the
-[DynamoDB based Locks 
Configurations](https://hudi.apache.org/docs/configurations#DynamoDB-based-Locks-Configurations)
-section for the details of each related configuration knob. Following are the 
basic configs required to setup this lock provider:
 
-| Config Name| Default| Description                                            
                                                                                
                                                                    |
-| ----------------------------------------------------------------------- | 
------------------------ 
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| hoodie.write.lock.dynamodb.endpoint_url| N/A **(Required)**         | For 
DynamoDB based lock provider, the url endpoint used for Amazon DynamoDB 
service. Useful for development with a local dynamodb instance.<br /><br 
/>`Config Param: DYNAMODB_ENDPOINT_URL`<br />`Since Version: 0.10.1`|
+The Amazon DynamoDB–based lock provider supports multi-writing across 
clusters. See [DynamoDB-Based Locks 
Configurations](configurations#DynamoDB-based-Locks-Configurations) for all 
options. Basic configuration:
 
-For advanced configs refer 
[here](https://hudi.apache.org/docs/next/configurations#DynamoDB-based-Locks-Configurations)
+| Config Name                             | Default            | Description   
                                                                                
                                      |
+|-----------------------------------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------|
+| hoodie.write.lock.dynamodb.endpoint_url | N/A **(Required)** | Endpoint URL 
for DynamoDB (useful for local testing).<br /><br />`Config Param: 
DYNAMODB_ENDPOINT_URL`<br />`Since Version: 0.10.1` |
 
+Further configurations: [DynamoDB-Based Locks 
Configurations](configurations#DynamoDB-based-Locks-Configurations)
 
-When using the DynamoDB-based lock provider, the name of the DynamoDB table 
acting as the lock table for Hudi is
-specified by the config `hoodie.write.lock.dynamodb.table`. This DynamoDB 
table is automatically created by Hudi, so you
-don't have to create the table yourself. If you want to use an existing 
DynamoDB table, make sure that an attribute with
-the name `key` is present in the table.  The `key` attribute should be the 
partition key of the DynamoDB table. The
-config `hoodie.write.lock.dynamodb.partition_key` specifies the value to put 
for the `key` attribute (not the attribute
-name), which is used for the lock on the same table. By default, 
`hoodie.write.lock.dynamodb.partition_key` is set to
-the table name, so that multiple writers writing to the same table share the 
same lock. If you customize the name, make
-sure it's the same across multiple writers.
+Table creation: Hudi auto-creates the DynamoDB table specified by 
`hoodie.write.lock.dynamodb.table`. If using an existing table, ensure a `key` 
attribute (as partition key) exists. `hoodie.write.lock.dynamodb.partition_key` 
is the value written for the partition key (default: table name), ensuring 
multiple writers share the same lock.
 
-Also, to set up the credentials for accessing AWS resources, customers can 
pass the following props to Hudi jobs:
-```
+Credential props (if not using the default provider chain):
+
+```properties
 hoodie.aws.access.key
 hoodie.aws.secret.key
 hoodie.aws.session.token
 ```
-If not configured, Hudi falls back to use 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
 
-IAM policy for your service instance will need to add the following 
permissions:
+If unset, Hudi falls back to the 
[DefaultAWSCredentialsProviderChain](https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html).
+
+Required IAM permissions:
 
 ```json
 {
-  "Sid":"DynamoDBLocksTable",
+  "Sid": "DynamoDBLocksTable",
   "Effect": "Allow",
   "Action": [
     "dynamodb:CreateTable",
@@ -139,36 +153,38 @@ IAM policy for your service instance will need to add the 
following permissions:
   "Resource": 
"arn:${Partition}:dynamodb:${Region}:${Account}:table/${TableName}"
 }
 ```
-- `TableName` : same as `hoodie.write.lock.dynamodb.partition_key`
-- `Region`: same as `hoodie.write.lock.dynamodb.region`
+
+* `TableName` : same as `hoodie.write.lock.dynamodb.partition_key`
+* `Region`: same as `hoodie.write.lock.dynamodb.region`
 
 AWS SDK dependencies are not bundled with Hudi from v0.10.x and will need to 
be added to your classpath.
 Add the following Maven packages (check the latest versions at time of 
install):
-```
+
+```text
 com.amazonaws:dynamodb-lock-client
 com.amazonaws:aws-java-sdk-dynamodb
 com.amazonaws:aws-java-sdk-core
 ```
 
-#### FileSystem based (not for production use)
+### FileSystem based lock provider
 
-FileSystem based lock provider supports multiple writers cross different 
jobs/applications based on atomic create/delete operations of the underlying 
filesystem.
+FileSystem based lock provider supports multiple writers across different 
jobs/applications based on atomic create/delete operations of the underlying 
filesystem.
 
-```
+```properties
 
hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.FileSystemBasedLockProvider
 ```
 
 When using the FileSystem based lock provider, by default, the lock file will 
store into `hoodie.base.path`+`/.hoodie/lock`. You may use a custom folder to 
store the lock file by specifying `hoodie.write.lock.filesystem.path`.
 
-In case the lock cannot release during job crash, you can set 
`hoodie.write.lock.filesystem.expire` (lock will never expire by default) to a 
desired expire time in minutes. You may also delete lock file manually in such 
situation.
-:::note
-FileSystem based lock provider is not supported with cloud storage like S3 or 
GCS.
-:::
+In case the lock cannot be released during job crash, you can set 
`hoodie.write.lock.filesystem.expire` (lock will never expire by default) to a 
desired expire time in minutes. You may also delete the lock file manually in 
such situation.
 
+:::warning
+FileSystem based lock provider is not for production use, and it is not 
supported with cloud storage like S3 or GCS.
+:::
 
 ## Simple Single writer + table services
 
-Data lakehouse pipelines tend to be predominantly single writer, with the most 
common need for distributed co-ordination on a table coming from table 
management. For e.g. a Apache Flink 
+Data lakehouse pipelines tend to be predominantly single writer, with the most 
common need for distributed co-ordination on a table coming from table 
management. For e.g. a Apache Flink
 job producing fast writes into a table, requiring regular file-size management 
or cleaning. Hudi's storage engine and platform tools provide a lot of support 
for such common scenarios.
 
 ### Inline table services
@@ -181,64 +197,57 @@ A single writer with all table services such as cleaning, 
clustering, compaction
 
 ### Async table services
 
-Hudi provides the option of running the table services in an async fashion, 
where most of the heavy lifting (e.g actually rewriting the columnar data by 
compaction service) is done asynchronously. In this model, the async deployment 
eliminates any repeated wasteful retries and optimizes the table using 
clustering techniques while a single writer consumes the writes to the table 
without having to be blocked by such table services. This model avoids the need 
for taking an [external lock](# [...]
+Hudi provides the option of running the table services in an async fashion, 
where most of the heavy lifting (e.g actually rewriting the columnar data by 
compaction service) is done asynchronously. In this model, the async deployment 
eliminates any repeated wasteful retries and optimizes the table using 
clustering techniques while a single writer consumes the writes to the table 
without having to be blocked by such table services. This model avoids the need 
for taking a [lock provider](#d [...]
 
-A single writer along with async table services runs in the same process. For 
example, you can have a  Hudi Streamer in continuous mode write to a MOR table 
using async compaction; you can use Spark Streaming (where 
[compaction](https://hudi.apache.org/docs/compaction) is async by default), and 
you can use Flink streaming or your own job setup and enable async table 
services inside the same writer.
+A single writer along with async table services runs in the same process. For 
example, you can have a  Hudi Streamer in continuous mode write to a MOR table 
using async compaction; you can use Spark Streaming (where 
[compaction](compaction) is async by default), and you can use Flink streaming 
or your own job setup and enable async table services inside the same writer.
 
 Hudi leverages **MVCC** in this model to support running any number of table 
service jobs concurrently, without any concurrency conflict.  This is made 
possible by ensuring Hudi 's ingestion writer and async table services 
coordinate among themselves to ensure no conflicts and avoid race conditions. 
The same single write guarantees described in Model A above can be achieved in 
this model as well.
 With this model users don't need to spin up different spark jobs and manage 
the orchestration among it. For larger deployments, this model can ease the 
operational burden significantly while getting the table services running 
without blocking the writers.
 
 **Single Writer Guarantees**
 
-In this model, the following are the guarantees on [write 
operations](https://hudi.apache.org/docs/write_operations/) to expect:
+In this model, the following are the guarantees on [write 
operations](write_operations) to expect:
 
-- *UPSERT Guarantee*: The target table will NEVER show duplicates.
-- *INSERT Guarantee*: The target table wilL NEVER have duplicates if dedup: 
[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates)
 & 
[`hoodie.combine.before.insert`](https://hudi.apache.org/docs/configurations/#hoodiecombinebeforeinsert),
 is enabled.
-- *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if 
dedup: 
[`hoodie.datasource.write.insert.drop.duplicates`](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates)
 & 
[`hoodie.combine.before.insert`](https://hudi.apache.org/docs/configurations/#hoodiecombinebeforeinsert),
 is enabled.
-- *INCREMENTAL QUERY Guarantee*: Data consumption and checkpoints are NEVER 
out of order.
+* *UPSERT Guarantee*: The target table will NEVER show duplicates.
+* *INSERT Guarantee*: The target table will NEVER have duplicates if dedup: 
[`hoodie.datasource.write.insert.drop.duplicates`](configurations#hoodiedatasourcewriteinsertdropduplicates)
 & [`hoodie.combine.before.insert`](configurations/#hoodiecombinebeforeinsert), 
is enabled.
+* *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if 
dedup: 
[`hoodie.datasource.write.insert.drop.duplicates`](configurations#hoodiedatasourcewriteinsertdropduplicates)
 & [`hoodie.combine.before.insert`](configurations/#hoodiecombinebeforeinsert), 
is enabled.
+* *INCREMENTAL QUERY Guarantee*: Data consumption and checkpoints are NEVER 
out of order.
 
 ## Full-on Multi-writer + Async table services
 
-Hudi has introduced a new concurrency mode `NON_BLOCKING_CONCURRENCY_CONTROL`, 
where unlike OCC, multiple writers can
-operate on the table with non-blocking conflict resolution. The writers can 
write into the same file group with the
-conflicts resolved automatically by the query reader and the compactor. The 
new concurrency mode is currently
-available for preview in version 1.0.1-beta only. You can read more about it 
under section [Model C: Multi-writer](#model-c-multi-writer).
+Hudi supports concurrency mode `NON_BLOCKING_CONCURRENCY_CONTROL`, where 
unlike OCC, multiple writers can
+operate on the table with non-blocking conflict resolution. The writers can 
write into the same file group with the conflicts resolved automatically by the 
query reader and the compactor. You can read more about it under [this 
section](#non-blocking-concurrency-control).
 
-It is not always possible to serialize all write operations to a table (such 
as UPSERT, INSERT or DELETE) into the same write process and therefore, 
multi-writing capability may be required. 
-In multi-writing, disparate distributed processes run in parallel or 
overlapping time windows to write to the same table. In such cases, an external 
locking mechanism is a must to safely 
-coordinate concurrent accesses. Here are few different scenarios that would 
all fall under multi-writing:
+It is not always possible to serialize all write operations to a table (such 
as UPSERT, INSERT or DELETE) into the same write process and therefore, 
multi-writing capability may be required.
+In multi-writing, disparate distributed processes run in parallel or 
overlapping time windows to write to the same table. In such cases, an external 
locking mechanism is a must to safely
+coordinate concurrent accesses. Here are a few different scenarios that would 
all fall under multi-writing:
 
-- Multiple ingestion writers to the same table:For instance, two Spark 
Datasource writers working on different sets of partitions form a source kafka 
topic.
-- Multiple ingestion writers to the same table, including one writer with 
async table services: For example, a Hudi Streamer with async compaction for 
regular ingestion & a Spark Datasource writer for backfilling.
-- A single ingestion writer and a separate compaction (HoodieCompactor) or 
clustering (HoodieClusteringJob) job apart from the ingestion writer: This is 
considered as multi-writing as they are not running in the same process.
+* Multiple ingestion writers to the same table: For instance, two Spark 
Datasource writers working on different sets of partitions form a source kafka 
topic.
+* Multiple ingestion writers to the same table, including one writer with 
async table services: For example, a Hudi Streamer with async compaction for 
regular ingestion & a Spark Datasource writer for backfilling.
+* A single ingestion writer and a separate compaction (HoodieCompactor) or 
clustering (HoodieClusteringJob) job apart from the ingestion writer: This is 
considered as multi-writing as they are not running in the same process.
 
 Hudi's concurrency model intelligently differentiates actual writing to the 
table from table services that manage or optimize the table. Hudi offers 
similar **optimistic concurrency control across multiple writers**, but **table 
services can still execute completely lock-free and async** as long as they run 
in the same process as one of the writers.
 For multi-writing, Hudi leverages file level optimistic concurrency 
control(OCC). For example, when two writers write to non overlapping files, 
both writes are allowed to succeed. However, when the writes from different 
writers overlap (touch the same set of files), only one of them will succeed. 
Please note that this feature is currently experimental and requires external 
lock providers to acquire locks briefly at critical sections during the write. 
More on lock providers below.
 
-#### Multi Writer Guarantees
+### Multi-Writer Guarantees
 
 With multiple writers using OCC, these are the write guarantees to expect:
 
-- *UPSERT Guarantee*: The target table will NEVER show duplicates.
-- *INSERT Guarantee*: The target table MIGHT have duplicates even if dedup is 
enabled.
-- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if 
dedup is enabled.
-- *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out 
of order. If there are inflight commits 
-  (due to multi-writing), incremental queries will not expose the completed 
commits following the inflight commits. 
+* *UPSERT Guarantee*: The target table will NEVER show duplicates.
+* *INSERT Guarantee*: The target table MIGHT have duplicates even if dedup is 
enabled.
+* *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if 
dedup is enabled.
+* *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER out 
of order. If there are inflight commits (due to multi-writing), incremental 
queries will not expose the completed commits following the inflight commits.
 
-## Non-Blocking Concurrency Control  
+## Non-Blocking Concurrency Control
 
-`NON_BLOCKING_CONCURRENCY_CONTROL`, offers the same set of guarantees as 
mentioned in the case of OCC but without
+The `NON_BLOCKING_CONCURRENCY_CONTROL` offers the same set of guarantees as 
mentioned in the case of OCC but without
 explicit locks for serializing the writes. Lock is only needed for writing the 
commit metadata to the Hudi timeline. The
 completion time for the commits reflects the serialization order and file 
slicing is done based on completion time.
 Multiple writers can operate on the table with non-blocking conflict 
resolution. The writers can write into the same
-file group with the conflicts resolved automatically by the query reader and 
the compactor. The new concurrency mode is
-currently available for preview in version 1.0.1-beta only with the caveat 
that conflict resolution is not supported yet
-between clustering and ingestion. It works for compaction and ingestion, and 
we can see an example of that with Flink
-writers [here](sql_dml#non-blocking-concurrency-control-experimental).
+file group with the conflicts resolved automatically by the query reader and 
the compactor. It works for compaction and ingestion, and we can see an example 
of that with [Flink 
writers](sql_dml#non-blocking-concurrency-control-experimental).
 
 :::note
-`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service 
writer is not yet supported for clustering.
-Please use `OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
+`NON_BLOCKING_CONCURRENCY_CONTROL` between ingestion writer and table service 
writer is not yet supported for clustering. Please use 
`OPTIMISTIC_CONCURRENCY_CONTROL` for clustering.
 :::
 
 ## Early conflict Detection
@@ -252,12 +261,11 @@ By default, this feature is turned off. To try this out, 
a user needs to set `ho
 Early conflict Detection in OCC is an **EXPERIMENTAL** feature
 :::
 
-
 ## Enabling Multi Writing
 
 The following properties are needed to be set appropriately to turn on 
optimistic concurrency control to achieve multi writing.
 
-```
+```properties
 hoodie.write.concurrency.mode=optimistic_concurrency_control
 hoodie.write.lock.provider=<lock-provider-classname>
 hoodie.cleaner.policy.failed.writes=LAZY
@@ -269,7 +277,6 @@ hoodie.cleaner.policy.failed.writes=LAZY
 | hoodie.write.lock.provider          | 
org.apache.hudi.client.transaction.lock.ZookeeperBasedLockProvider (Optional) | 
Lock provider class name, user can provide their own implementation of 
LockProvider which should be subclass of 
org.apache.hudi.common.lock.LockProvider<br /><br />`Config Param: 
LOCK_PROVIDER_CLASS_NAME`<br />`Since Version: 0.8.0`                           
                                                                                
                                  [...]
 | hoodie.cleaner.policy.failed.writes | EAGER (Optional)                       
                                       | 
org.apache.hudi.common.model.HoodieFailedWritesCleaningPolicy: Policy that 
controls how to clean up failed writes. Hudi will delete any files written by 
failed writes to re-claim space.     EAGER(default): Clean failed writes inline 
after every write operation.     LAZY: Clean failed writes lazily after 
heartbeat timeout when the cleaning service runs. This policy is re [...]
 
-
 ### Multi Writing via Hudi Streamer
 
 The Hudi Streamer (part of hudi-utilities-slim-bundle) provides ways to ingest 
from different sources such as DFS or Kafka, with the following capabilities.
@@ -280,13 +287,13 @@ A Hudi Streamer job can then be triggered as follows:
 
 ```java
 [hoodie]$ spark-submit \
-  --jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.0.1.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.0.1.jar"
 \
+  --jars 
"packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle_2.12-1.1.0.jar,packaging/hudi-spark-bundle/target/hudi-spark3.5-bundle_2.12-1.1.0.jar"
 \
   --class org.apache.hudi.utilities.streamer.HoodieStreamer `ls 
packaging/hudi-utilities-slim-bundle/target/hudi-utilities-slim-bundle-*.jar` \
   --props 
file://${PWD}/hudi-utilities/src/test/resources/streamer-config/kafka-source.properties
 \
   --schemaprovider-class 
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
   --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
   --source-ordering-field impresssiontime \
-  --target-base-path file:\/\/\/tmp/hudi-streamer-op \ 
+  --target-base-path file:///tmp/hudi-streamer-op \ 
   --target-table tableName \
   --op BULK_INSERT
 ```
@@ -317,42 +324,41 @@ inputDF.write.format("hudi")
 
 Remove the following settings that were used to enable multi-writer or 
override with default values.
 
-```
+```properties
 hoodie.write.concurrency.mode=single_writer
 hoodie.cleaner.policy.failed.writes=EAGER
 ```
 
 ## OCC Best Practices
 
-Concurrent Writing to Hudi tables requires acquiring a lock with one of the 
lock providers mentioned above. Due to several reasons you might want to 
configure retries to allow your application to acquire the lock. 
+Concurrent Writing to Hudi tables requires acquiring a lock with one of the 
lock providers mentioned above. Due to several reasons you might want to 
configure retries to allow your application to acquire the lock.
+
 1. Network connectivity or excessive load on servers increasing time for lock 
acquisition resulting in timeouts
 2. Running a large number of concurrent jobs that are writing to the same hudi 
table can result in contention during lock acquisition can cause timeouts
 3. In some scenarios of conflict resolution, Hudi commit operations might take 
upto 10's of seconds while the lock is being held. This can result in timeouts 
for other jobs waiting to acquire a lock.
 
-Set the correct native lock provider client retries. 
-:::note 
+Set the correct native lock provider client retries.:::note
 Please note that sometimes these settings are set on the server once and all 
clients inherit the same configs. Please check your settings before enabling 
optimistic concurrency.
 :::
-   
-```
+
+```properties
 hoodie.write.lock.wait_time_ms
 hoodie.write.lock.num_retries
 ```
 
-Set the correct hudi client retries for Zookeeper & HiveMetastore. This is 
useful in cases when native client retry settings cannot be changed. Please 
note that these retries will happen in addition to any native client retries 
that you may have set. 
+Set the correct hudi client retries for Zookeeper & HiveMetastore. This is 
useful in cases when native client retry settings cannot be changed. Please 
note that these retries will happen in addition to any native client retries 
that you may have set.
 
-```
+```properties
 hoodie.write.lock.client.wait_time_ms
 hoodie.write.lock.client.num_retries
 ```
 
 *Setting the right values for these depends on a case by case basis; some 
defaults have been provided for general cases.*
 
-
 ## Caveats
 
-If you are using the `WriteClient` API, please note that multiple writes to 
the table need to be initiated from 2 different instances of the write client. 
-It is **NOT** recommended to use the same instance of the write client to 
perform multi writing. 
+If you are using the `WriteClient` API, please note that multiple writes to 
the table need to be initiated from 2 different instances of the write client.
+It is **NOT** recommended to use the same instance of the write client to 
perform multi writing.
 
 ## Related Resources
 
diff --git a/website/docs/record_merger.md b/website/docs/record_merger.md
index e90633d60639..948055b2ff27 100644
--- a/website/docs/record_merger.md
+++ b/website/docs/record_merger.md
@@ -27,7 +27,7 @@ There are three merge modes: `COMMIT_TIME_ORDERING`, 
`EVENT_TIME_ORDERING`, and
 
 You can explicitly configure the merge mode using the write config 
`hoodie.write.record.merge.mode`. When you create or write to a table with this 
config, it will be persisted to the table's configuration file 
(`hoodie.properties`) as `hoodie.record.merge.mode`. Once persisted, all 
subsequent reads and writes will use this merge mode unless explicitly 
overridden in the write config.
 
-**Merge modes are built-in, ready-to-use record mergers.** For most use cases, 
you can choose `COMMIT_TIME_ORDERING` or `EVENT_TIME_ORDERING` by setting the 
appropriate ordering field(s) without any additional configuration or 
implementation. These modes provide standard merging behaviors that cover the 
majority of scenarios. For advanced use cases requiring custom merge logic, you 
can use the `CUSTOM` merge mode and implement the `HoodieRecordMerger` 
interface.
+Merge mode is the way to choose and configure the record merger for your 
tables. For most use cases, you can choose `COMMIT_TIME_ORDERING` or 
`EVENT_TIME_ORDERING` by setting the appropriate ordering field(s) without any 
additional configuration or implementation. These modes provide standard 
merging behaviors that cover the majority of scenarios. For advanced use cases 
requiring custom merge logic, you can use the `CUSTOM` merge mode and implement 
the `HoodieRecordMerger` interface.
 
 :::note
 The merge mode should not be altered once a table is created to avoid 
inconsistent behavior due to compaction producing different merge results when 
switching between modes.

(hudi) branch asf-site updated: docs: update lock provider info (#14325)

Reply via email to