[GitHub] [spark] c21 commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

GitBox Sun, 12 Jul 2020 16:11:07 -0700


c21 commented on pull request #28123:
URL: https://github.com/apache/spark/pull/28123#issuecomment-657289241



   Thank you @maropu and @imback82 
   
   > As for (1) and (3), IMO its worth digging into it for more improvements.
   
   For (1): I created a PR to cover shuffled hash join as well - 
https://github.com/apache/spark/pull/29079. Could you help take a look? Thanks.
   
   > I had a rough POC few months back, where each row is filtered out based on 
its bucket id, but I never got a chance to run benchmarks. @c21 Do you have 
some numbers to share? I am wondering how reading multiple copies impacts the 
overall runtime.
   
   We have some internal numbers in production, e.g. we are seeing like 50% CPU 
improvement for specific queries. So comparing divide (extra N times for 
reading input table, N is the ratio between two tables buckets) and non-divide 
(extra 1 time for shuffle write, and 1 time for read), if the join big tables 
buckets ratio is 2 (e.g. join 510 buckets with 1024 buckets), dividing can be 
better than non-dividing as we only do 1 extra read for input table, but avoid 
1 shuffle write and 1 shuffle read (but this also depends on efficiency of 
shuffled service). I think we can add benchmark in `JoinBenchmark` for joining 
table with buckets ratio to be 2 to showcase this improvement.
   
   Re POC - I feel overall approach looks good to me. But IMO I think we should 
do the coalesce/divide in physical plan rule, but not logical plan rule. Also I 
think we probably can leave vectorization code path aside now, as it introduces 
too much change to handle vectorization when doing filter for it. Let me know 
if you are still on it, or I can help with this feature as well. Thanks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] c21 commented on pull request #28123: [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable

Reply via email to