c21 commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-657299848


   > Here are some numbers when I joined two tables (store_sales from TPC-DS - 
100 SF) and did `count` on it. It's run on 8 executors (8 cores each) and 
generates about 47GB of shuffle.
   > 
   > Bucket Size #1     Bucket Size #2  Coalesce On (ms)        Coalesce Off 
(ms)       Gain (%)
   > 512        256     40495   48435   16.39310416
   > 512        128     42459   49597   14.39199952
   > 512        64      45760   48888   6.398298151
   > 256        128     41241   49034   15.8930538
   > 256        64      42902   51063   15.98221804
   > 128        64      44131   53192   17.03451647
   > There is a modest 15% gain (for ratio up to 4), WDYT?
   
   @imback82 sounds good. I am refactoring and to making the coalesce cover 
shuffled hash join as well in (https://github.com/apache/spark/pull/29079).
   
   I would like to run same query as you did here, and verified performance 
numbers. Wondering could you share more instructions for which query in TPCDS 
you ran, or the query you used if you changed based the benchmark? Thanks.
   
   > Without supporting ColumnarBatch, you cannot enable the wholestage codegen
   
   Sorry I might miss something, but I feel it's not true. From my 
understanding, wholestage codegen is on row-based, cannot work with 
vectorization yet.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to