Re: [PR] feat(flink): Support dynamic bucket for flink streaming with partitio… [hudi]

via GitHub Thu, 07 May 2026 01:03:36 -0700


hudi-agent commented on code in PR #18640:
URL: https://github.com/apache/hudi/pull/18640#discussion_r3199840712



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1478,6 +1479,10 @@ private Set<String> 
getNonStreamingMetadataPartitionsToUpdate() {
    */
   protected Pair<List<HoodieFileGroupId>, HoodieData<HoodieRecord>> 
tagRecordsWithLocationForStreamingWrites(HoodieData<HoodieRecord> 
untaggedRecords,
                                                                                
                              Set<String> enabledMetadataPartitions) {
+    // no need to tag of the incoming records is empty.

Review Comment:
   🤖 nit: typo — "no need to tag of the incoming records is empty" should 
probably be "no need to tag records if the incoming data is empty".
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/OptionsResolver.java:
##########
@@ -459,7 +467,8 @@ public static boolean isPopulateMetaFields(Configuration 
conf) {
    */
   public static boolean isStreamingIndexWriteEnabled(Configuration conf) {
     return conf.get(FlinkOptions.METADATA_ENABLED)
-        && OptionsResolver.getIndexType(conf) == 
HoodieIndex.IndexType.GLOBAL_RECORD_LEVEL_INDEX
+        && (OptionsResolver.getIndexType(conf) == 
HoodieIndex.IndexType.GLOBAL_RECORD_LEVEL_INDEX

Review Comment:
   🤖 nit: `OptionsResolver.getIndexType(conf)` is called twice in the same 
boolean expression — could you extract it to a local variable like the 
surrounding methods do?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/partitioner/DynamicBucketAssignFunction.java:
##########
@@ -0,0 +1,164 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.sink.partitioner;
+
+import org.apache.hudi.adapter.KeyedProcessFunctionAdapter;
+import org.apache.hudi.client.common.HoodieFlinkEngineContext;
+import org.apache.hudi.client.model.HoodieFlinkInternalRow;
+import org.apache.hudi.common.model.HoodieTableType;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.hudi.configuration.FlinkOptions;
+import org.apache.hudi.configuration.HadoopConfigurations;
+import org.apache.hudi.configuration.OptionsResolver;
+import org.apache.hudi.hadoop.fs.HadoopFSUtils;
+import org.apache.hudi.sink.event.Correspondent;
+import org.apache.hudi.sink.partitioner.index.DummyPartitionedIndexBackend;
+import org.apache.hudi.sink.partitioner.index.RecordLevelIndexBackend;
+import org.apache.hudi.sink.partitioner.index.PartitionedIndexBackend;
+import org.apache.hudi.sink.partitioner.profile.WriteProfile;
+import org.apache.hudi.sink.partitioner.profile.WriteProfiles;
+import org.apache.hudi.table.action.commit.BucketInfo;
+import org.apache.hudi.table.action.commit.BucketType;
+import org.apache.hudi.util.FlinkTaskContextSupplier;
+import org.apache.hudi.util.FlinkWriteClients;
+import org.apache.hudi.utils.RuntimeContextUtils;
+
+import lombok.Setter;
+import org.apache.flink.api.common.state.CheckpointListener;
+import org.apache.flink.configuration.Configuration;
+import org.apache.flink.runtime.state.FunctionInitializationContext;
+import org.apache.flink.runtime.state.FunctionSnapshotContext;
+import org.apache.flink.runtime.state.KeyGroupRangeAssignment;
+import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
+import org.apache.flink.util.Collector;
+
+/**
+ * Assigns Flink streaming records to dynamic bucket file groups.
+ *
+ * <p>This function first checks the partition-scoped RLI backend for an 
existing
+ * {@code recordKey -> fileGroupId} mapping. Existing keys are routed as 
updates to
+ * the recorded file group; new keys are assigned by {@link BucketAssigner} 
and then
+ * written back to the backend so the streaming metadata writer can persist 
the assignment to RLI.
+ */
+public class DynamicBucketAssignFunction
+    extends KeyedProcessFunctionAdapter<String, HoodieFlinkInternalRow, 
HoodieFlinkInternalRow>
+    implements CheckpointedFunction, CheckpointListener {
+
+  private final Configuration conf;
+  private final boolean isInsertOverwrite;
+
+  private transient PartitionedIndexBackend indexBackend;
+  private transient BucketAssigner bucketAssigner;
+
+  @Setter
+  protected transient Correspondent correspondent;
+
+  private transient int maxParallelism;
+  private transient int numTasks;
+  private transient int taskId;
+
+  /**
+   * Creates the dynamic bucket assign function for one bucket assign operator.
+   *
+   * @param conf Flink write configuration
+   */
+  public DynamicBucketAssignFunction(Configuration conf) {
+    this.conf = conf;
+    this.isInsertOverwrite = OptionsResolver.isInsertOverwrite(conf);
+  }
+
+  @Override
+  public void open(Configuration parameters) throws Exception {
+    super.open(parameters);
+
+    HoodieWriteConfig writeConfig = 
FlinkWriteClients.getHoodieClientConfig(this.conf, 
!OptionsResolver.isIncrementalJobGraph(conf));
+    HoodieFlinkEngineContext context = new HoodieFlinkEngineContext(
+        
HadoopFSUtils.getStorageConfWithCopy(HadoopConfigurations.getHadoopConf(this.conf)),
+        new FlinkTaskContextSupplier(getRuntimeContext()));
+    boolean delta = 
HoodieTableType.valueOf(conf.get(FlinkOptions.TABLE_TYPE)).equals(HoodieTableType.MERGE_ON_READ);
+    WriteProfile writeProfile = WriteProfiles.singleton(isInsertOverwrite, 
delta, writeConfig, context);
+    this.bucketAssigner = new BucketAssigner(
+        RuntimeContextUtils.getIndexOfThisSubtask(getRuntimeContext()),
+        
RuntimeContextUtils.getMaxNumberOfParallelSubtasks(getRuntimeContext()),
+        RuntimeContextUtils.getNumberOfParallelSubtasks(getRuntimeContext()),
+        writeProfile,
+        writeConfig);
+    this.maxParallelism = 
RuntimeContextUtils.getMaxNumberOfParallelSubtasks(getRuntimeContext());

Review Comment:
   🤖 nit: `getMaxNumberOfParallelSubtasks`, `getNumberOfParallelSubtasks`, and 
`getIndexOfThisSubtask` are each called twice — once inside the 
`BucketAssigner` constructor args and again to assign the fields. Could you 
assign the three fields first and then pass them to the constructor?
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(flink): Support dynamic bucket for flink streaming with partitio… [hudi]

Reply via email to