imback82 commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-643047878
Here are some numbers when I joined two tables (store_sales from TPC-DS - 100 SF) and did `count` on it. It's run on 8 executors (8 cores each) and generates about 47GB of shuffle. Bucket Size #1 | Bucket Size #2 | Coalesce On (ms) | Coalesce Off (ms) | Gain (%) -- | -- | -- | -- | -- 512 | 256 | 40495 | 48435 | 16.39310416 512 | 128 | 42459 | 49597 | 14.39199952 512 | 64 | 45760 | 48888 | 6.398298151 256 | 128 | 41241 | 49034 | 15.8930538 256 | 64 | 42902 | 51063 | 15.98221804 128 | 64 | 44131 | 53192 | 17.03451647 There is a modest 15% gain (for ratio up to 4), WDYT? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
