boneanxs commented on issue #8833:
URL:
https://github.com/apache/incubator-gluten/issues/8833#issuecomment-2689663025
The output consists of two columns, and the partition number is set to 100.
```
Input [3]: [hash_partition_key#1821, strategies#193, attributes#197]
Arguments: hashpartitioning(cast(get_json_object(attributes#197, $.SellerID)
as bigint), 100, None), ENSURE_REQUIREMENTS, [strategies#193, attributes#197],
[id=#1768], [shuffle_writer_type=hash]
```
by changing to sort type, it looks normal now.
Besides, I see these 2 values are set very high, do we want to avoid using
sort based shuffle(maybe hash shuffle usually is faster)?
```scala
val COLUMNAR_SHUFFLE_SORT_PARTITIONS_THRESHOLD =
buildConf("spark.gluten.sql.columnar.shuffle.sort.partitions.threshold")
.internal()
.doc("The threshold to determine whether to use sort-based columnar
shuffle. Sort-based " +
"shuffle will be used if the number of partitions is greater than
this threshold.")
.intConf
.createWithDefault(100000)
val COLUMNAR_SHUFFLE_SORT_COLUMNS_THRESHOLD =
buildConf("spark.gluten.sql.columnar.shuffle.sort.columns.threshold")
.internal()
.doc("The threshold to determine whether to use sort-based columnar
shuffle. Sort-based " +
"shuffle will be used if the number of columns is greater than this
threshold.")
.intConf
.createWithDefault(100000)
```
Are there any best practices for selecting between hash-based and sort-based
shuffle when using Gluten? If we can automate this selection, it could simplify
future upgrades to Gluten. Now some jobs get poor performance because of this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]