boneanxs commented on code in PR #10515:
URL: https://github.com/apache/hudi/pull/10515#discussion_r1461279717
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##########
@@ -105,10 +121,55 @@ public int numPartitions() {
}
};
- return rows.sparkSession().createDataFrame(rowJavaRDD
- .mapToPair(row -> new Tuple2<>(getBucketId(row), row))
- .partitionBy(partitioner)
- .values(), rows.schema());
+ if (sortColumnNames != null && sortColumnNames.length > 0) {
+ return rows.sparkSession().createDataFrame(rowJavaRDD
+ .mapToPair(row -> new Tuple2<>(row, row))
Review Comment:
Don't get the point here.
For key, you only need bucketId, right?
I mean we can change `.mapToPair(row -> new Tuple2<>(row, row))` to
`.mapToPair(row -> new Tuple2<>(row, row))`
For the second issue, partitionBy + sortWithinPartitions also does shuffle
only once?
I'm trying to avoid customized comparators here like
`CustomRowColumnsComparator`
##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/ConsistentBucketIndexBulkInsertPartitionerWithRows.java:
##########
@@ -105,10 +121,55 @@ public int numPartitions() {
}
};
- return rows.sparkSession().createDataFrame(rowJavaRDD
- .mapToPair(row -> new Tuple2<>(getBucketId(row), row))
- .partitionBy(partitioner)
- .values(), rows.schema());
+ if (sortColumnNames != null && sortColumnNames.length > 0) {
+ return rows.sparkSession().createDataFrame(rowJavaRDD
+ .mapToPair(row -> new Tuple2<>(row, row))
+ .repartitionAndSortWithinPartitions(partitioner, new
CustomRowColumnsComparator())
+ .values(),
+ rows.schema());
+ } else if (table.requireSortedRecords() ||
table.getConfig().getBulkInsertSortMode() != BulkInsertSortMode.NONE) {
Review Comment:
I'm fine with the current behavior. Different sort modes are rarely set from
user side, and bucket index + partition sort is already one special kind of
`PARTITION_PATH_REPARTITION`.
I'm ok automatically switching to `PARTITION_SORT` to not annoying users.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]