Re: [PR] [HUDI-6495][RFC-66] Non-blocking Concurrency Control [hudi]

via GitHub Mon, 16 Oct 2023 05:20:58 -0700


beyond1920 commented on code in PR #7907:
URL: https://github.com/apache/hudi/pull/7907#discussion_r1360569304



##########
rfc/rfc-66/rfc-66.md:
##########
@@ -0,0 +1,318 @@
+# RFC-66: Non-blocking Concurrency Control
+
+## Proposers
+- @danny0405
+- @ForwardXu
+
+## Approvers
+-
+
+## Status
+
+JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
+
+## Abstract
+As you know, Hudi already supports basic OCC with abundant lock providers.
+But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
+Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
+all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
+so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
+For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
+that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
+
+In general, for multiple streaming writers ingestion, OCC is not very feasible 
in production, in this RFC, we propose a non-blocking solution for streaming 
ingestion.
+
+## Background
+
+Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generating all the time in 
streaming way.
+
+Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
+the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
+
+Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
+the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
+we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
+
+## Design
+
+### The Precondition
+
+#### MOR Table Type Is Required
+
+The table type must be `MERGE_ON_READ`, so that we can defer the conflict 
resolution to the compaction phase. The compaction service would resolve the 
conflicts of the same keys by respecting the event time sequence of the events.
+
+#### Deterministic Bucketing Strategy
+
+Deterministic bucketing strategy is required, because the same records keys 
from different writers are desired to be distributed into the same bucket, not 
only for UPSERTs, but also for all the new INSERTs.

Review Comment:
   Is Non-blocking concurrent control only work for insert and update, and not 
for insert overwrite? 
   The following points are different for `insert overwrite` with `insert` or 
`upsert`:
   1. Generating a fixed file group based on the bucket number is not 
applicable for `insert overwrite`. Using the same file group before and after 
insert overwrite will leads to incorrect results.
   
   2. If there are multiple writers, the one that fires first may finish later. 
Is it the one that fired first overwrites the data generated by the one that 
fired later, or is it the one that finished first overwrites the data generated 
by the one that finished later?
   
   3. Assuming we have two commit: t1 -> commit1, t2 -> commit2. Commit2 is 
fired later, it may complete earlier. When generating `partitionToReplaceIds` 
metadata for this job, a lock might be needed here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-6495][RFC-66] Non-blocking Concurrency Control [hudi]

Reply via email to