[GitHub] [hudi] bhasudha commented on a diff in pull request #9372: [DOCS]Update Concurrency page

via GitHub Mon, 28 Aug 2023 03:43:49 -0700


bhasudha commented on code in PR #9372:
URL: https://github.com/apache/hudi/pull/9372#discussion_r1307260385



##########
website/docs/concurrency_control.md:
##########
@@ -2,105 +2,126 @@
 title: "Concurrency Control"
 summary: In this page, we will discuss how to perform concurrent writes to 
Hudi Tables.
 toc: true
+toc_min_heading_level: 2
+toc_max_heading_level: 4
 last_modified_at: 2021-03-19T15:59:57-04:00
 ---
+Concurrency control defines how different writers/readers coordinate access to 
the table. Hudi ensures atomic writes, by way of publishing commits atomically 
to the timeline, stamped with an instant time that denotes the time at which 
the action is deemed to have occurred. Unlike general purpose file version 
control, Hudi draws clear distinction between writer processes (that issue 
user’s upserts/deletes), table services (that write data/metadata to 
optimize/perform bookkeeping) and readers (that execute queries and read data). 
Hudi provides snapshot isolation between all three types of processes, meaning 
they all operate on a consistent snapshot of the table. Hudi provides 
optimistic concurrency control (OCC) between writers, while providing 
lock-free, non-blocking MVCC based concurrency control between writers and 
table-services and between different table services.
 
-In this section, we will cover Hudi's concurrency model and describe ways to 
ingest data into a Hudi Table from multiple writers; using the [Hudi 
Streamer](#hudi-streamer) tool as well as 
-using the [Hudi datasource](#datasource-writer).
+In this section, we will discuss the different concurrency controls supported 
by Hudi and how they are leveraged to provide flexible deployment models; we 
will cover multi-writing, a  popular deployment model; finally, we’ll describe 
ways to ingest data into a Hudi Table from multiple writers using different 
writers, like  DeltaStreamer, Hudi datasource, Spark Structured Streaming and 
Spark SQL.
 
-## Supported Concurrency Controls
 
-- **MVCC** : Hudi table services such as compaction, cleaning, clustering 
leverage Multi Version Concurrency Control to provide snapshot isolation
-between multiple table service writers and readers. Additionally, using MVCC, 
Hudi provides snapshot isolation between an ingestion writer and multiple 
concurrent readers. 
-  With this model, Hudi supports running any number of table service jobs 
concurrently, without any concurrency conflict. 
-  This is made possible by ensuring that scheduling plans of such table 
services always happens in a single writer mode to ensure no conflict and 
avoids race conditions.
+## Deployment models with supported concurrency controls
 
-- **[NEW] OPTIMISTIC CONCURRENCY** : Write operations such as the ones 
described above (UPSERT, INSERT) etc, leverage optimistic concurrency control 
to enable multiple ingestion writers to
-the same Hudi Table. Hudi supports `file level OCC`, i.e., for any 2 commits 
(or writers) happening to the same table, if they do not have writes to 
overlapping files being changed, both writers are allowed to succeed. 
-  This feature is currently *experimental* and requires either Zookeeper or 
HiveMetastore to acquire locks.
+### Model A: Single writer with inline table services
 
-It may be helpful to understand the different guarantees provided by [write 
operations](/docs/write_operations/) via Hudi datasource or the Hudi Streamer.
+This is the simplest form of concurrency, meaning there is no concurrency at 
all in the write processes. In this model, Hudi eliminates the need for 
concurrency control and maximizes throughput by supporting these table services 
out-of-box and running inline after every write to the table. Execution plans 
are idempotent, persisted to the timeline and auto-recover from failures. For 
most simple use-cases, this means just writing is sufficient to get a 
well-managed table that needs no concurrency control.
 
-## Single Writer Guarantees
+Although there is no actual concurrent writing in this model, there is a need 
to provide snapshot isolation between readers and writers. **MVCC** is 
leveraged to provide such isolation between ingestion writer and multiple 
readers and also between multiple table service writers and readers. Writes to 
the table either from ingestion or from table services produce versioned data 
that are available to readers only after the writes are committed. Until then, 
readers can access only the previous version of the data.
 
- - *UPSERT Guarantee*: The target table will NEVER show duplicates.
- - *INSERT Guarantee*: The target table wilL NEVER have duplicates if 
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is 
enabled.
- - *BULK_INSERT Guarantee*: The target table will NEVER have duplicates if 
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is 
enabled.
- - *INCREMENTAL PULL Guarantee*: Data consumption and checkpoints are NEVER 
out of order.
+A single writer with all table services such as cleaning, clustering, 
compaction, etc can be configured to be inline (such as DeltaStreamer sync-once 
mode and Spark Datasource with default configs) without any additional configs.
 
-## Multi Writer Guarantees
+#### Single Writer Guarantees
 
-With multiple writers using OCC, some of the above guarantees change as follows
+In this model, the following are the guarantees on [write 
operations](https://hudi.apache.org/docs/write_operations/) to expect:
 
 - *UPSERT Guarantee*: The target table will NEVER show duplicates.
-- *INSERT Guarantee*: The target table MIGHT have duplicates even if 
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is 
enabled.
-- *BULK_INSERT Guarantee*: The target table MIGHT have duplicates even if 
[dedup](/docs/configurations#hoodiedatasourcewriteinsertdropduplicates) is 
enabled.
+- *INSERT Guarantee*: The target table wilL NEVER have duplicates if 
[dedup](https://hudi.apache.org/docs/configurations#hoodiedatasourcewriteinsertdropduplicates)
 is enabled.

Review Comment:
   @nsivabalan Maybe I am not understanding correctly? Based on 
[dropDuplicates](https://github.com/apache/hudi/blob/6e84cfe3d41f163ab653fea0309489f3e0c76215/hudi-spark-datasource/hudi-spark-common/src/main/java/org/apache/hudi/DataSourceUtils.java#L289)
 it seems to be referring to the record in storage (by tagging) to check for 
duplicates? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] bhasudha commented on a diff in pull request #9372: [DOCS]Update Concurrency page

Reply via email to