Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

via GitHub Fri, 12 Jun 2026 22:21:29 -0700


voonhous commented on code in PR #18979:
URL: https://github.com/apache/hudi/pull/18979#discussion_r3407479318



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -129,7 +132,7 @@ public int getPartition(Object key) {
     Option<HoodieRecordLocation> location = keyLocation._2;
     int bucketId = location.isPresent()
         ? BucketIdentifier.bucketIdFromFileId(location.get().getFileId())
-        : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(), 
indexKeyField, numBuckets);
+        : BucketIdentifier.getBucketId(keyLocation._1.getRecordKey(), 
indexKeyFieldList, numBuckets);

Review Comment:
   Good catch -- folded all three Spark sites into this PR: 
`SparkPartitionBucketIndexPartitioner` (the default partition-level 
simple-bucket partitioner), and the two consistent-hashing paths 
`ConsistentBucketIndexBulkInsertPartitionerWithRows` and 
`SingleSparkJobConsistentHashingExecutionStrategy`. Each now precomputes 
`KeyGenUtils.getIndexKeyFields(...)` once and calls the existing `List` 
overload.
   
   I also swept the remaining bucket-index call sites to confirm these were the 
only ones: everything else (`BucketIndexBulkInsertPartitioner`, 
`HoodieBucketIndex` / `HoodieSimpleBucketIndex` / 
`HoodieConsistentBucketIndex`, `SparkConsistentBucketDuplicateUpdateStrategy`, 
and read-side `BucketIndexSupport`) already takes a `List`, so these three were 
the only remaining Spark re-parse sites.
   
   Done in af1b99b.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/SparkBucketIndexPartitioner.java:
##########
@@ -80,7 +83,7 @@ public SparkBucketIndexPartitioner(WorkloadProfile profile,
               + table.getIndex().getClass().getSimpleName());
     }
     this.numBuckets = ((HoodieBucketIndex) table.getIndex()).getNumBuckets();
-    this.indexKeyField = config.getBucketIndexHashField();
+    this.indexKeyFieldList = 
KeyGenUtils.getIndexKeyFields(config.getBucketIndexHashField());

Review Comment:
   Confirmed all four Flink sites re-parse per record 
(`BucketIndexPartitioner`, `BucketIndexRemotePartitioner`, 
`BucketStreamWriteFunction`, and `BucketBulkInsertWriterHelper` -- where the 
String is threaded per record from `Pipelines.java` and needs the static 
signature changed to `List`). To keep this PR Spark-coherent I'll handle the 
Flink mirror as a separate follow-up under #18978 rather than expanding it 
cross-module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] perf(spark): Parse bucket index hash-field config once instead of per… [hudi]

Reply via email to