Re: [PR] [HUDI-8990][RFC-89] Partition Level Bucket Index [hudi]

via GitHub Fri, 28 Feb 2025 02:29:24 -0800


zhangyue19921010 commented on code in PR #12884:
URL: https://github.com/apache/hudi/pull/12884#discussion_r1975178378



##########
rfc/rfc-89/rfc-89.md:
##########
@@ -0,0 +1,297 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-89: Partition Level Bucket Index
+
+## Proposers
+- @zhangyue19921010
+
+## Approvers
+- @danny0405
+- @codope
+- @xiarixiaoyao
+
+## Status
+
+JIRA: https://issues.apache.org/jira/browse/HUDI-8990
+
+## Abstract
+
+As we know, Hudi proposed and introduced Bucket Index in RFC-29. Bucket Index 
can well unify the indexes of Flink and
+Spark, that is, Spark and Flink could upsert the same Hudi table using bucket 
index.
+
+However, Bucket Index has a limit of fixed number of buckets. In order to 
solve this problem, RFC-42 proposed the ability
+of consistent hashing achieving bucket resizing by splitting or merging 
several local buckets dynamically.
+
+But from PRD experience, sometimes a Partition-Level Bucket Index and a 
offline way to do bucket rescale is good enough
+without introducing additional efforts (multiple writes, clustering, automatic 
resizing,etc.). Because the more complex
+the Architecture, the more error-prone it is and the greater operation and 
maintenance pressure.
+
+In this regard, we could upgrade the traditional Bucket Index to implement a 
Partition-Level Bucket Index, so that users
+can set a specific number of buckets for different partitions through a rule 
engine (such as regular expression matching).
+On the other hand, for a certain existing partitions, an off-line command is 
provided to reorganized the data using insert
+overwrite(need to stop the data writing of the current partition).
+
+More importantly, the existing Bucket Index table can be upgraded to 
Partition-Level Bucket Index smoothly and seamlessly.
+
+## Background
+The following is the core read-write process of the Flink/Spark engine based 
on Simple Bucket Index
+### Flink Write Using Simple Bucket Index
+**Step 1**: re-partition input records based on `BucketIndexPartitioner`, 
BucketIndexPartitioner has **a fixed bucketNumber** for all partition path.
+For each record key, compute a fixed data partition number, doing re-partition 
works.
+
+```java
+/**
+ * Bucket index input partitioner.
+ * The fields to hash can be a subset of the primary key fields.
+ *
+ * @param <T> The type of obj to hash
+ */
+public class BucketIndexPartitioner<T extends HoodieKey> implements 
Partitioner<T> {
+
+  private final int bucketNum;
+  private final String indexKeyFields;
+
+  private Functions.Function2<String, Integer, Integer> partitionIndexFunc;
+
+  public BucketIndexPartitioner(int bucketNum, String indexKeyFields) {
+    this.bucketNum = bucketNum;
+    this.indexKeyFields = indexKeyFields;
+  }
+
+  @Override
+  public int partition(HoodieKey key, int numPartitions) {
+    if (this.partitionIndexFunc == null) {
+      this.partitionIndexFunc = 
BucketIndexUtil.getPartitionIndexFunc(bucketNum, numPartitions);
+    }
+    int curBucket = BucketIdentifier.getBucketId(key.getRecordKey(), 
indexKeyFields, bucketNum);
+    return this.partitionIndexFunc.apply(key.getPartitionPath(), curBucket);
+  }
+}
+```
+**Step 2**: Using `BucketStreamWriteFunction` upsert records into hoodie
+- Bootstrap and cache `partition_bucket -> fileID` mapping from the existing 
hudi table
+- Tagging: compute `bucketNum` and tag `fileID` based on record key and 
bucketNumber config through `BucketIdentifier`
+- buffer and write records
+
+### Flink Read Pruning Using Simple Bucket Index
+**Step 1**: compute `dataBucket`
+```java
+  private int getDataBucket(List<ResolvedExpression> dataFilters) {
+    if (!OptionsResolver.isBucketIndexType(conf) || dataFilters.isEmpty()) {
+      return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+    }
+    Set<String> indexKeyFields = 
Arrays.stream(OptionsResolver.getIndexKeys(conf)).collect(Collectors.toSet());
+    List<ResolvedExpression> indexKeyFilters = 
dataFilters.stream().filter(expr -> ExpressionUtils.isEqualsLitExpr(expr, 
indexKeyFields)).collect(Collectors.toList());
+    if (!ExpressionUtils.isFilteringByAllFields(indexKeyFilters, 
indexKeyFields)) {
+      return PrimaryKeyPruners.BUCKET_ID_NO_PRUNING;
+    }
+    return PrimaryKeyPruners.getBucketId(indexKeyFilters, conf);
+  }
+```
+**Step 2**: Do partition pruning and get all files in given partitions
+**Step 3**: do bucket pruning for all files from step2
+```java
+  /**
+   * Returns all the file statuses under the table base path.
+   */
+  public List<StoragePathInfo> getFilesInPartitions() {
+    ...
+    // Partition pruning
+    String[] partitions =
+        getOrBuildPartitionPaths().stream().map(p -> fullPartitionPath(path, 
p)).toArray(String[]::new);
+    if (partitions.length < 1) {
+      return Collections.emptyList();
+    }
+    List<StoragePathInfo> allFiles = ...
+    
+    // bucket pruning
+    if (this.dataBucket >= 0) {
+      String bucketIdStr = BucketIdentifier.bucketIdStr(this.dataBucket);
+      List<StoragePathInfo> filesAfterBucketPruning = allFiles.stream()
+          .filter(fileInfo -> 
fileInfo.getPath().getName().contains(bucketIdStr))
+          .collect(Collectors.toList());
+      logPruningMsg(allFiles.size(), filesAfterBucketPruning.size(), "bucket 
pruning");
+      allFiles = filesAfterBucketPruning;
+    }
+    ...
+  }
+
+```
+
+### Spark Write/Read Using Simple Bucket Index
+The read-write process of Spark based on Bucket Index is also similar.
+- Use `HoodieSimpleBucketIndex` to tag location.
+- Use `SparkBucketIndexPartitioner` to packs incoming records to be inserted 
into buckets (1 bucket = 1 RDD partition).
+- Use `BucketIndexSupport` to Bucket Index pruning during reading.
+
+## Design
+### Config
+Add a new config named `hoodie.bucket.index.partition.expressions` default 
null. Users can specify the bucket numbers for different
+partitions by configuring a JSON expression. For example
+```json
+{
+    "expressions": [
+        {
+            "expression": "11-11",
+            "bucketNumber": 10,
+            "rule": "regex"
+        },
+        {
+            "expression": "01-01",
+            "bucketNumber": 20,
+            "rule": "regex"
+        },
+        {
+            "expression": "dt>2025-01-01",
+            "bucketNumber": 20,
+            "rule": "range"
+        }
+    ]
+}
+```
+for partitions match different rule will get and set corresponding bucket 
number.
+
+We can determine whether the user is currently using the partition-level 
bucket index based on the value of
+`hoodie.bucket.index.partition.expressions`. If it is null, the processing 
behavior will be exactly the same as the current logic.
+The advantage of this approach is that it can be fully compatible with the 
current design of the table-level bucket index,
+enabling a seamless migration for users without their awareness.
+
+### Hashing Metadata
+The hashing metadata will be persisted as files named as 
`<instant>.simple_hashing_meta` for current table, it stores in
+`.hoodie/.simple_hashing_meta/` directory and contains the following 
information in a readable encoding
+
+```avro schema
+{
+   "namespace":"org.apache.hudi.avro.model",
+   "type":"record",
+   "doc": "hashing meta for current table using partition level simple bucket 
index",
+   "name":"SimpleHashMeta",
+   "fields":[
+     {
+       "name":"version",
+       "type":["int", "null"],
+       "default": 1
+     },
+     {
+       "name": "partitionMeta",
+       "type": [
+         "null", {
+           "type":"map",
+           "values": {
+             "type": "record",
+             "name": "PartitionHashInfo",
+             "fields": [
+               {
+                 "name": "bucketNumber",
+                 "type": ["null", "int"],
+                 "default": null
+               },
+               {
+                 "name": "instant",
+                 "type":["null","string"],
+                 "default": null
+               }
+             ]
+           }}],
+       "default" : null
+     },
+     {
+       "name":"defaultBucketNumber",
+       "type":["int", "null"],
+       "default": 256
+     },
+   ]
+}
+```
+We will write <instant>.simple_hashing_meta during commit action:
+1. Call TransactionManager to beginTransaction
+2. Get last complete T1.simple_hashing_meta
+3. Merging T1.simple_hashing_meta with new written partitions as 
T2.simple_hashing_meta
+4. Call TransactionManager to endTransaction
+
+### SimpleBucketIdentifier
+```java
+public class SimpleBucketIdentifier extends BucketIdentifier {
+  // use expressionEvaluator to compute bucket number for given partition path
+  private BucketCalculator calculator;

Review Comment:
   First, in a multi-write scenario, we expect different jobs to load the same 
expression configuration file. In this way, the in-memory caches of multiple 
jobs are always synchronized because the calculation logic for partitions is 
consistent.
   
   Second, in the scenario where different jobs load different versions of 
expressions, we will conduct two conflict checks. 
   1. The first check is performed in the partitioner, where we compare the 
bucket numbers in the cache with the calculated results for conflict detection. 
   2. The second check is carried out during the submission phase to ensure 
consistency.
   
   Also as @LinMingQiang said, hashing_meta couldn't be modified during writing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8990][RFC-89] Partition Level Bucket Index [hudi]

Reply via email to