This is an automated email from the ASF dual-hosted git repository.
danny0405 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git
The following commit(s) were added to refs/heads/master by this push:
new 08f02dd2681 [MINOR] Improve the writing of RFC-66 (#10331)
08f02dd2681 is described below
commit 08f02dd2681505c2c8d81c33cda781f3d83f4187
Author: Lin Liu <[email protected]>
AuthorDate: Thu Dec 14 18:12:39 2023 -0800
[MINOR] Improve the writing of RFC-66 (#10331)
Restructed some sentences, and did not change the original intention.
---
rfc/rfc-66/rfc-66.md | 40 ++++++++++++++++++++--------------------
1 file changed, 20 insertions(+), 20 deletions(-)
diff --git a/rfc/rfc-66/rfc-66.md b/rfc/rfc-66/rfc-66.md
index 25c230a9bcb..d3754ca15af 100644
--- a/rfc/rfc-66/rfc-66.md
+++ b/rfc/rfc-66/rfc-66.md
@@ -12,26 +12,26 @@
JIRA: [Lockless multi writer
support](https://issues.apache.org/jira/browse/HUDI-5672)
## Abstract
-As you know, Hudi already supports basic OCC with abundant lock providers.
-But for multi streaming ingestion writers, the OCC does not work well because
the conflicts happen in very high frequency.
-Expand it a little bit, with hashing index, all the writers have deterministic
hashing algorithm for distributing the records by primary keys,
-all the keys are evenly distributed in all the data buckets, for a single data
flushing in one writer, almost all the data buckets are appended with new
inputs,
-so the conflict would very possibility happen for mul-writer because almost
all the data buckets are being written by multiple writers at the same time;
-For bloom filter index, things are different, but remember that we have a
small file load rebalance strategy to writer into the **small** bucket in
higher priority,
-that means, multiple writers prune to write into the same **small** buckets at
the same time, that's how conflicts happen.
-
-In general, for multiple streaming writers ingestion, OCC is not very feasible
in production, in this RFC, we propose a non-blocking solution for streaming
ingestion.
## Background
-
-Streaming jobs are naturally suitable for data ingestion, it has no complexity
of pipeline orchestration and has a smother write workload.
-Most of the raw data set we are handling today are generating all the time in
streaming way.
-
-Based on that, many requests for multiple writers' ingestion are derived. With
multi-writer ingestion, several streaming events with the same schema can be
drained into one Hudi table,
-the Hudi table kind of becomes a UNION table view for all the input data set.
This is a very common use case because in reality, the data sets are usually
scattered all over the data sources.
-
-Another very useful use case we wanna unlock is the real-time data set join.
One of the biggest pain point in streaming computation is the dataset join,
-the engine like Flink has basic supports for all kind of SQL JOINs, but it
stores the input records within its inner state-backend which is a huge cost
for pure data join with no additional computations.
+As you know, Hudi already supports basic OCC with abundant lock providers.
+However, for multi-writer streaming ingestion, the OCC does not work well
because conflicts would happen very frequently.
+For hashing index, all the writers utilize a deterministic hashing algorithm
on primary keys to distribute records.
+In normal cases, these keys are evenly distributed into all data buckets. That
means, in a single data flushing, one writer could append to
+all the data buckets, and conflicts happen when there multiple such writers.
+For bloom filter index, the situation is slightly different. We write into the
**small** bucket in higher priority using a small-file-load-rebalancing
strategy,
+such that multiple writers is prone to write into the same **small** buckets
at the same time, which causes conflicts.
+Therefore, OCC does not work well for multiple streaming writers ingestion. In
this RFC, we propose a non-blocking solution for streaming ingestion.
+
+Streaming jobs are suitable for data ingestion since it does not need complex
pipeline orchestration and has a smother write workload.
+Most of the raw data set we are handling today are generated constantly in
streaming way.
+
+In multi-writer ingestion, several streaming events with the same schema sink
into one Hudi table, such that the Hudi table becomes
+a UNION table view for all input data set. This is a common use case in
reality since the data could come from various data sources.
+
+Another important use case we want to unlock is the real-time data set join.
One of the serious pain points in streaming computation is the dataset join.
+For example, Flink engine has basic supports for SQL JOINs, meanwhile, it
stores the input records in its inner state-backend. Such design is very
expensive
+for pure data join with no additional computations.
In [HUDI-3304](https://issues.apache.org/jira/browse/HUDI-3304), we introduced
a `PartialUpdateAvroPayload`, in combination with the lockless multi-writer,
we can implement N-ways data sources join in real-time! Hudi would take care
of the payload join during compaction service procedure.
@@ -65,8 +65,8 @@ The log files generated by a single writer can still preserve
the sequence by ve
### The Compaction Procedure
-The compaction service is the duty role that actually resoves the conflicts.
Within a file group, it sorts the files then merge all the record payloads for
a record key.
-The event time sequence is respected by combining the payloads with even time
field provided by the payload (known as the `preCombine` field in Hudi).
+The compaction service is the duty role that actually resolves the conflicts.
Within a file group, it sorts the files then merge all the record payloads for
a record key.
+The event time sequence is respected by combining the payloads with event time
field provided by the payload (known as the `preCombine` field in Hudi).
