[GitHub] [hudi] YuweiXiao commented on a change in pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

GitBox Fri, 25 Feb 2022 00:22:20 -0800


YuweiXiao commented on a change in pull request #4441:
URL: https://github.com/apache/hudi/pull/4441#discussion_r814562736




##########
File path: 
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java
##########
@@ -18,24 +18,64 @@
 
 package org.apache.hudi.table;
 
+import org.apache.hudi.common.fs.FSUtils;
+import org.apache.hudi.io.WriteHandleFactory;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.stream.Collectors;
+import java.util.stream.IntStream;
+
 /**
  * Repartition input records into at least expected number of output spark 
partitions. It should give below guarantees -
  * Output spark partition will have records from only one hoodie partition. - 
Average records per output spark
  * partitions should be almost equal to (#inputRecords / 
#outputSparkPartitions) to avoid possible skews.
  */
-public interface BulkInsertPartitioner<I> {
+public abstract class BulkInsertPartitioner<I> implements Serializable {
+
+  private WriteHandleFactory defaultWriteHandleFactory;
+  private List<String> fileIdPfx;

Review comment:
       The overall design indeed is partitionId -> fileIdPrefix 
(`fileIdPfxList`), partitionId -> writeHandleFactory (`getWriteHandleFactory` 
interface). Of course, will go with having those organized in the constructor, 
which should make the design more clear.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] YuweiXiao commented on a change in pull request #4441: [HUDI-3085] improve bulk insert partitioner abstraction

Reply via email to