This is an automated email from the ASF dual-hosted git repository.

danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new 08f02dd2681 [MINOR] Improve the writing of RFC-66 (#10331)
08f02dd2681 is described below

commit 08f02dd2681505c2c8d81c33cda781f3d83f4187
Author: Lin Liu <[email protected]>
AuthorDate: Thu Dec 14 18:12:39 2023 -0800

    [MINOR] Improve the writing of RFC-66 (#10331)
    
    Restructed some sentences, and did not change the original intention.
---
 rfc/rfc-66/rfc-66.md | 40 ++++++++++++++++++++--------------------
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/rfc/rfc-66/rfc-66.md b/rfc/rfc-66/rfc-66.md
index 25c230a9bcb..d3754ca15af 100644
--- a/rfc/rfc-66/rfc-66.md
+++ b/rfc/rfc-66/rfc-66.md
@@ -12,26 +12,26 @@
 JIRA: [Lockless multi writer 
support](https://issues.apache.org/jira/browse/HUDI-5672)
 
 ## Abstract
-As you know, Hudi already supports basic OCC with abundant lock providers.
-But for multi streaming ingestion writers, the OCC does not work well because 
the conflicts happen in very high frequency.
-Expand it a little bit, with hashing index, all the writers have deterministic 
hashing algorithm for distributing the records by primary keys,
-all the keys are evenly distributed in all the data buckets, for a single data 
flushing in one writer, almost all the data buckets are appended with new 
inputs,
-so the conflict would very possibility happen for mul-writer because almost 
all the data buckets are being written by multiple writers at the same time;
-For bloom filter index, things are different, but remember that we have a 
small file load rebalance strategy to writer into the **small** bucket in 
higher priority,
-that means, multiple writers prune to write into the same **small** buckets at 
the same time, that's how conflicts happen.
-
-In general, for multiple streaming writers ingestion, OCC is not very feasible 
in production, in this RFC, we propose a non-blocking solution for streaming 
ingestion.
 
 ## Background
-
-Streaming jobs are naturally suitable for data ingestion, it has no complexity 
of pipeline orchestration and has a smother write workload.
-Most of the raw data set we are handling today are generating all the time in 
streaming way.
-
-Based on that, many requests for multiple writers' ingestion are derived. With 
multi-writer ingestion, several streaming events with the same schema can be 
drained into one Hudi table,
-the Hudi table kind of becomes a UNION table view for all the input data set. 
This is a very common use case because in reality, the data sets are usually 
scattered all over the data sources.
-
-Another very useful use case we wanna unlock is the real-time data set join. 
One of the biggest pain point in streaming computation is the dataset join,
-the engine like Flink has basic supports for all kind of SQL JOINs, but it 
stores the input records within its inner state-backend which is a huge cost 
for pure data join with no additional computations.
+As you know, Hudi already supports basic OCC with abundant lock providers.
+However, for multi-writer streaming ingestion, the OCC does not work well 
because conflicts would happen very frequently.
+For hashing index, all the writers utilize a deterministic hashing algorithm 
on primary keys to distribute records.
+In normal cases, these keys are evenly distributed into all data buckets. That 
means, in a single data flushing, one writer could append to
+all the data buckets, and conflicts happen when there multiple such writers.
+For bloom filter index, the situation is slightly different. We write into the 
**small** bucket in higher priority using a small-file-load-rebalancing 
strategy,
+such that multiple writers is prone to write into the same **small** buckets 
at the same time, which causes conflicts.
+Therefore, OCC does not work well for multiple streaming writers ingestion. In 
this RFC, we propose a non-blocking solution for streaming ingestion.
+
+Streaming jobs are suitable for data ingestion since it does not need complex 
pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generated constantly in 
streaming way.
+
+In multi-writer ingestion, several streaming events with the same schema sink 
into one Hudi table, such that the Hudi table becomes 
+a UNION table view for all input data set. This is a common use case in 
reality since the data could come from various data sources.
+
+Another important use case we want to unlock is the real-time data set join. 
One of the serious pain points in streaming computation is the dataset join.
+For example, Flink engine has basic supports for SQL JOINs, meanwhile, it 
stores the input records in its inner state-backend. Such design is very 
expensive 
+for pure data join with no additional computations.
 In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced 
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
 we can implement N-ways data sources join in real-time! Hudi would take care 
of the payload join during compaction service procedure.
 
@@ -65,8 +65,8 @@ The log files generated by a single writer can still preserve 
the sequence by ve
 
 ### The Compaction Procedure
 
-The compaction service is the duty role that actually resoves the conflicts. 
Within a file group, it sorts the files then merge all the record payloads for 
a record key.
-The event time sequence is respected by combining the payloads with even time 
field provided by the payload (known as the `preCombine` field in Hudi).
+The compaction service is the duty role that actually resolves the conflicts. 
Within a file group, it sorts the files then merge all the record payloads for 
a record key.
+The event time sequence is respected by combining the payloads with event time 
field provided by the payload (known as the `preCombine` field in Hudi).
 
 ![compaction procedure](compaction.png)
 

Reply via email to