c21 commented on pull request #28123: URL: https://github.com/apache/spark/pull/28123#issuecomment-657289241
Thank you @maropu and @imback82 > As for (1) and (3), IMO its worth digging into it for more improvements. For (1): I created a PR to cover shuffled hash join as well - https://github.com/apache/spark/pull/29079. Could you help take a look? Thanks. > I had a rough POC few months back, where each row is filtered out based on its bucket id, but I never got a chance to run benchmarks. @c21 Do you have some numbers to share? I am wondering how reading multiple copies impacts the overall runtime. We have some internal numbers in production, e.g. we are seeing like 50% CPU improvement for specific queries. So comparing divide (extra N times for reading input table, N is the ratio between two tables buckets) and non-divide (extra 1 time for shuffle write, and 1 time for read), if the join big tables buckets ratio is 2 (e.g. join 510 buckets with 1024 buckets), dividing can be better than non-dividing as we only do 1 extra read for input table, but avoid 1 shuffle write and 1 shuffle read (but this also depends on efficiency of shuffled service). I think we can add benchmark in `JoinBenchmark` for joining table with buckets ratio to be 2 to showcase this improvement. Re POC - I feel overall approach looks good to me. But IMO I think we should do the coalesce/divide in physical plan rule, but not logical plan rule. Also I think we probably can leave vectorization code path aside now, as it introduces too much change to handle vectorization when doing filter for it. Let me know if you are still on it, or I can help with this feature as well. Thanks. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
