Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

via GitHub Sun, 21 Jan 2024 18:25:06 -0800


boneanxs commented on code in PR #10515:
URL: https://github.com/apache/hudi/pull/10515#discussion_r1461279717



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##########
@@ -105,10 +121,55 @@ public int numPartitions() {
       }
     };
 
-    return rows.sparkSession().createDataFrame(rowJavaRDD
-        .mapToPair(row -> new Tuple2<>(getBucketId(row), row))
-        .partitionBy(partitioner)
-        .values(), rows.schema());
+    if (sortColumnNames != null && sortColumnNames.length > 0) {
+      return rows.sparkSession().createDataFrame(rowJavaRDD
+              .mapToPair(row -> new Tuple2<>(row, row))

Review Comment:
   Don't get the point here.
   For key, you only need bucketId, right?
   I mean we can change `.mapToPair(row -> new Tuple2<>(row, row))` to 
`.mapToPair(row -> new Tuple2<>(row, row))`
   
   For the second issue, partitionBy + sortWithinPartitions also does shuffle 
only once?
   
   I'm trying to avoid customized comparators here like 
`CustomRowColumnsComparator`



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##########
@@ -105,10 +121,55 @@ public int numPartitions() {
       }
     };
 
-    return rows.sparkSession().createDataFrame(rowJavaRDD
-        .mapToPair(row -> new Tuple2<>(getBucketId(row), row))
-        .partitionBy(partitioner)
-        .values(), rows.schema());
+    if (sortColumnNames != null && sortColumnNames.length > 0) {
+      return rows.sparkSession().createDataFrame(rowJavaRDD
+              .mapToPair(row -> new Tuple2<>(row, row))
+              .repartitionAndSortWithinPartitions(partitioner, new 
CustomRowColumnsComparator())
+              .values(),
+          rows.schema());
+    } else if (table.requireSortedRecords() || 
table.getConfig().getBulkInsertSortMode() != BulkInsertSortMode.NONE) {

Review Comment:
   I'm fine with the current behavior. Different sort modes are rarely set from 
user side, and bucket index + partition sort is already one special kind of 
`PARTITION_PATH_REPARTITION`.
   
   I'm ok automatically switching to `PARTITION_SORT` to not annoying users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7302] Consistent hashing row writer support sorting [hudi]

Reply via email to