[PR] [VL] Provide options to combine small batches before sending to shuffle [incubator-gluten]

via GitHub Thu, 06 Jun 2024 19:43:56 -0700


zhztheplayer opened a new pull request, #6009:
URL: https://github.com/apache/incubator-gluten/pull/6009


   It's observed that Velox hash-based shuffle is slowed down by small input 
batches.
   
   The patch:
   
   1. Adds two options:
      - `spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle`
         (Default: false) Set to true to combine small batches with minimal 
batch size determined by `spark.gluten.sql.columnar.maxBatchSize`. (Note the 
misnaming of `maxBatchSize` in Gluten, it might tend to be `minBatchSize`)
      - `spark.gluten.sql.columnar.backend.velox.minBatchSizeForShuffle`
         (Optional) Set to override the minimal batch used by 
`coalesceBatchesBeforeShuffle`.
   2. Does essential code refactors and cleanups. 
   
   ### Comparisons
   (by setting 
spark.gluten.sql.columnar.backend.velox.coalesceBatchesBeforeShuffle=true):
   
   Q31 total time, before and after:
   
![image](https://github.com/apache/incubator-gluten/assets/11284395/718583ad-15a2-473b-86b5-94189c2c5c9b)
   
   Closer look at exchange, before and after:
   
   
![image](https://github.com/apache/incubator-gluten/assets/11284395/3db6bad1-0a04-4879-a923-4ec2b0ee4fce)
   
   
![image](https://github.com/apache/incubator-gluten/assets/11284395/93d55924-d8ef-43f5-b994-c8333fa3d8ab)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [VL] Provide options to combine small batches before sending to shuffle [incubator-gluten]

Reply via email to