c21 commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-657299848
> Here are some numbers when I joined two tables (store_sales from TPC-DS - 100 SF) and did `count` on it. It's run on 8 executors (8 cores each) and generates about 47GB of shuffle. > > Bucket Size #1 Bucket Size #2 Coalesce On (ms) Coalesce Off (ms) Gain (%) > 512 256 40495 48435 16.39310416 > 512 128 42459 49597 14.39199952 > 512 64 45760 48888 6.398298151 > 256 128 41241 49034 15.8930538 > 256 64 42902 51063 15.98221804 > 128 64 44131 53192 17.03451647 > There is a modest 15% gain (for ratio up to 4), WDYT? @imback82 sounds good. I am refactoring and to making the coalesce cover shuffled hash join as well in (https://github.com/apache/spark/pull/29079). I would like to run same query as you did here, and verified performance numbers. Wondering could you share more instructions for which query in TPCDS you ran, or the query you used if you changed based the benchmark? Thanks. > Without supporting ColumnarBatch, you cannot enable the wholestage codegen Sorry I might miss something, but I feel it's not true. From my understanding, wholestage codegen is on row-based, cannot work with vectorization yet. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
