[GitHub] [hudi] YuweiXiao commented on a change in pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

GitBox Fri, 25 Feb 2022 00:19:27 -0800


YuweiXiao commented on a change in pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#discussion_r814560947




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
##########
@@ -18,24 +18,64 @@
 
 package org.apache.hudi.table;
 
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.io.WriteHandleFactory;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
 /**
  * Repartition input records into at least expected number of output spark 
partitions. It should give below guarantees -
  * Output spark partition will have records from only one hoodie partition. - 
Average records per output spark
  * partitions should be almost equal to (#inputRecords / 
#outputSparkPartitions) to avoid possible skews.
  */
-public interface BulkInsertPartitioner<I> {
+public abstract class BulkInsertPartitioner<I> implements Serializable {
+
+  private WriteHandleFactory defaultWriteHandleFactory;
+  private List<String> fileIdPfx;

Review comment:
       Yes, in order to enable concurrent clustering and upsert to a same file 
group, we have to control how records are routing to file group in the 
clustering (which uses bulk_insert to write records). So in my case, customized 
ClusteringExecutionStrategy and BulkInsertPartitioner are implemented. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YuweiXiao commented on a change in pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

Reply via email to