prakharjain09 opened a new pull request #28424: URL: https://github.com/apache/spark/pull/28424
### What changes were proposed in this pull request? ReplaceIntersectWithSemiJoin catalyst optimizer rule replaces Intersect with Distinct followed by LeftSemi Join. In some cases, pushing the distinct down the join can give improvements. This PR adds support for the same based on Stats. ### Why are the changes needed? Pushing the Distinct down in cases when it converts the join to broadcast join or in cases when it reduces the rows by some fraction can be more performant. ### Does this PR introduce _any_ user-facing change? Added a new config "spark.sql.cbo.optimizeIntersect.enabled" which triggers this cost based optimization. ### How was this patch tested? Added UTs. Performance Benchmarking SQL query over TPCDS 1000 scale data: _(select ss_item_sk from store_sales) intersect (select cs_item_sk from catalog_sales)_ Time taken before: 3 mins Time taken after: 1.8 mins ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
