beliefer opened a new pull request, #42317: URL: https://github.com/apache/spark/pull/42317
### What changes were proposed in this pull request? Currently, Spark runtime filter supports multi level shuffle join side as filter creation side. Please see: https://github.com/apache/spark/pull/39170. Although this feature adds the adaptive scene and improves the performance, there are still need to support other case. Let me show the SQL below. ``` SELECT * FROM ( SELECT * FROM tab1 JOIN tab2 ON tab1.c1 = tab2.c2 WHERE bf2.a2 = 5 ) AS a JOIN tab3 ON tab3.c3 = a.c1 ``` For the current implementation, Spark only inject runtime filter into tab1 with bloom filter based on `bf2.a2 = 5`. Because there is no the join between tab3 and tab2, so Spark can't inject runtime filter into tab3 with the same bloom filter. But the above SQL have the join condition `tab3.c3 = a.c1(tab1.c1)` between tab3 and tab2, and also have the join condition `tab1.c1 = tab2.c2`. We can rely on the transitivity of the join condition to get the virtual join condition `tab3.c3 = tab2.c2`, then we can inject the bloom filter based on `bf2.a2 = 5` into tab3. ### Why are the changes needed? Enhance the Spark runtime filter and improve performance. ### Does this PR introduce _any_ user-facing change? 'No'. Just update the inner implementation. ### How was this patch tested? New tests. Micro benchmark for q75 in TPC-DS. **2TB TPC-DS** -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
