vinothchandar commented on a change in pull request #4326:
URL: https://github.com/apache/hudi/pull/4326#discussion_r823346613



##########
File path: rfc/rfc-42/rfc-42.md
##########
@@ -0,0 +1,215 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-42: Consistent Hashing Index for Dynamic Bucket Number
+
+
+## Proposers
+
+- @HuberyLee
+- @hujincalrin
+- @stream2000
+- @YuweiXiao
+
+## Approvers
+
+ - @garyli1019
+ - @leesf
+ - @vinothchandar
+
+## Status
+
+JIRA: [HUDI-3000](https://issues.apache.org/jira/browse/HUDI-3000)
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Hudi supports `Upsert` operation to de-duplicate records in a table, which 
depends on indexing schemes to perform record location lookup.
+Among many index options, bucket index (in progress, 
[RFC-29](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index))
 achieves promising Upsert performance, around ~3x improvement on throughput 
compared to using Bloom Filter.
+However, it requires pre-configure a fixed bucket number and cannot be changed 
afterwards.
+Combined with the design of one-one mapping between hash buckets and file 
groups, hudi tables with bucket index have some practical issues, such as data 
skew and unlimited file group size, which now can only be resolved by resetting 
a suitable bucket number through re-writing the whole table.
+
+This proposal wants to tackle these problems by introducing **Consistent 
Hashing Index**.
+It achieves bucket resizing by splitting or merging several local buckets 
(i.e., only large file groups) while leaving most buckets untouched.
+This feature allows us to adjust bucket number dynamically in a background 
service with minimal impacts on downstream systems relying on Hudi. 
+For example, concurrent readers and writers are not blocked during the 
resizing.
+
+
+## Background
+
+Hudi supports the primary key concept from day one through a write operation 
called `Upsert`.
+To correctly enforce the uniqueness of keys, `Upsert` performs indexing to 
locate data files where every record belongs.
+One of the index implementations is `Bucket Index`, shown in the following 
figure.
+It distributes records to buckets using a hash function, and each bucket 
corresponds to a single file group (i.e., one-one mapping).
+This simple yet effective design reduce the time complexity of the key lookup 
to constant time (i.e., hash function computation), bringing good write 
performance.
+
+![bucket index](./basic_bucket_hashing.png)
+
+However, there are also some limitions.
+As described in 
[RFC-29](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index),
 the one-one mapping between buckets and file groups may cause data skew and 
doesn&#39;t scale well.
+One solution to address these problems is allowing one bucket to have multiple 
file groups, which in turn requires indexing to be performed inside each bucket.
+
+Another solution, that this proposal chooses, is to adjust bucket number 
dynamically based on Consistent Hashing.
+In contrast to a standard re-hashing process, which needs shuffling of the 
whole table, Consistent Hashing constrains the re-hashing process to touch 
several local buckets (e.g, only large file groups).
+The figure below shows a basic Consistent Hashing algorithm:
+
+![consistent hashing index](./consistent_hashing.png)
+
+Hash value is obtained by computing `Hash(v) % 0xFFFF`, which falls into a 
pre-defined range (i.e., [0, 0xFFFF] in the figure). 
+Then a range mapping is applied to the hash value to get the final bucket.
+The figure also demonstrates a local bucket split process, where Bucket #2 is 
split into two children buckets and increases the total number of buckets by 
one.
+Compared to a traditional hashing scheme, Consistent Hashing introduces an 
extra range mapping layer, linking hash values and buckets.
+<!-- When a large bucket is identified, the corresponding range will be split, 
producing two children buckets containing records rehashed from the original 
bucket. -->
+
+
+## Implementation
+
+The design consideration and implementation will mostly follow the current 
Bucket Index:
+
+1. Hashing happens at partition-level, i.e., each partition is managed 
independently and will be divided into N buckets.
+2. One-one mapping between hash buckets and file groups.
+3. The assignment of records to buckets fits into a standard write path by 
overriding the `tag location` process.
+
+To further support Consistent Hashing, we also need to consider:

Review comment:
       We are on the same page




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to