prakharjain09 opened a new pull request #28424:
URL: https://github.com/apache/spark/pull/28424


   ### What changes were proposed in this pull request?
   ReplaceIntersectWithSemiJoin catalyst optimizer rule replaces Intersect with 
Distinct followed by LeftSemi Join. In some cases, pushing the distinct down 
the join can give improvements. This PR adds support for the same based on 
Stats.
   
   ### Why are the changes needed?
   Pushing the Distinct down in cases when it converts the join to broadcast 
join or in cases when it reduces the rows by some fraction can be more 
performant. 
   
   ### Does this PR introduce _any_ user-facing change?
   Added a new config "spark.sql.cbo.optimizeIntersect.enabled" which triggers 
this cost based optimization.
   
   ### How was this patch tested?
   Added UTs.
   
   Performance Benchmarking
   SQL query over TPCDS 1000 scale data: _(select ss_item_sk  from store_sales) 
intersect (select cs_item_sk from catalog_sales)_
   Time taken before: 3 mins
   Time taken after: 1.8 mins
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to