zhztheplayer opened a new pull request, #6009:
URL: https://github.com/apache/incubator-gluten/pull/6009
It's observed that Velox hash-based shuffle is slowed down by small input
batches.
The patch:
1. Adds two options:
- `spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle`
(Default: false) Set to true to combine small batches with minimal
batch size determined by `spark.gluten.sql.columnar.maxBatchSize`. (Note the
misnaming of `maxBatchSize` in Gluten, it might tend to be `minBatchSize`)
- `spark.gluten.sql.columnar.backend.velox.minBatchSizeForShuffle`
(Optional) Set to override the minimal batch used by
`coalesceBatchesBeforeShuffle`.
2. Does essential code refactors and cleanups.
### Comparisons
(by setting
spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle=true):
Q31 total time, before and after:

Closer look at exchange, before and after:


--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]