wForget commented on code in PR #3076:
URL: https://github.com/apache/datafusion-comet/pull/3076#discussion_r2719279083
##########
common/src/main/scala/org/apache/comet/CometConf.scala:
##########
@@ -365,6 +365,33 @@ object CometConf extends ShimCometConf {
.booleanConf
.createWithDefault(true)
+ val COMET_EXEC_SHUFFLE_WITH_ROUND_ROBIN_PARTITIONING_ENABLED:
ConfigEntry[Boolean] =
+ conf("spark.comet.native.shuffle.partitioning.roundrobin.enabled")
+ .category(CATEGORY_SHUFFLE)
+ .doc(
+ "Whether to enable round robin partitioning for Comet native shuffle.
" +
+ "This is disabled by default because Comet's round-robin produces
different " +
+ "partition assignments than Spark. Spark sorts rows by their binary
UnsafeRow " +
+ "representation before assigning partitions, but Comet uses Arrow
format which " +
+ "has a different binary layout. Instead, Comet implements
round-robin as hash " +
+ "partitioning on all columns, which achieves the same goals: even
distribution, " +
+ "deterministic output (for fault tolerance), and no semantic
grouping. " +
+ "Sorted output will be identical to Spark, but unsorted row ordering
may differ.")
+ .booleanConf
+ .createWithDefault(false)
+
+ val COMET_EXEC_SHUFFLE_WITH_ROUND_ROBIN_PARTITIONING_MAX_HASH_COLUMNS:
ConfigEntry[Int] =
+ conf("spark.comet.native.shuffle.partitioning.roundrobin.maxHashColumns")
+ .category(CATEGORY_SHUFFLE)
+ .doc(
+ "The maximum number of columns to hash for round robin partitioning. "
+
+ "When set to 0 (the default), all columns are hashed. " +
+ "When set to a positive value, only the first N columns are used for
hashing, " +
+ "which can improve performance for wide tables while still providing
" +
+ "reasonable distribution.")
+ .intConf
+ .createWithDefault(0)
Review Comment:
add checkValue:
```
.checkValue(v => v >= 0, "The maximum number of columns to hash for round
robin partitioning must be non-negative.")
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]