YuweiXiao commented on a change in pull request #4480:
URL: https://github.com/apache/hudi/pull/4480#discussion_r816410681



##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/bucket/HoodieBucketIndex.java
##########
@@ -68,62 +69,46 @@ public HoodieBucketIndex(HoodieWriteConfig config) {
       HoodieData<HoodieRecord<R>> records, HoodieEngineContext context,
       HoodieTable hoodieTable)
       throws HoodieIndexException {
-    HoodieData<HoodieRecord<R>> taggedRecords = 
records.mapPartitions(recordIter -> {
-      // partitionPath -> bucketId -> fileInfo
-      Map<String, Map<Integer, Pair<String, String>>> partitionPathFileIDList 
= new HashMap<>();
-      return new LazyIterableIterator<HoodieRecord<R>, 
HoodieRecord<R>>(recordIter) {
+    // initialize necessary information before tagging. e.g., hashing metadata
+    List<String> partitions = 
records.map(HoodieRecord::getPartitionPath).distinct().collectAsList();
+    LOG.info("Initializing hashing metadata for partitions: " + partitions);
+    initialize(hoodieTable, partitions);
 
-        @Override
-        protected void start() {
+    return records.mapPartitions(iterator ->
+        new LazyIterableIterator<HoodieRecord<R>, HoodieRecord<R>>(iterator) {
 
-        }
+          @Override
+          protected void start() {

Review comment:
       Yeah, good idea. It is a utility class existed long ago, will add a 
default impl to it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to