windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy Placement in Intersect Distinct URL: https://github.com/apache/spark/pull/24215 ## What changes were proposed in this pull request? Intersect operator will be replace by Left Semi Join in Optimizer. for example: ``` SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2 ==> SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND a2<=>b2 ``` if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce the table data before Join by place groupby operator under join, that is ``` ==> SELECT a1, a2 FROM (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X LEFT SEMI JOIN (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y ON X.a1<=>Y.b1 AND X.a2<=>Y.b2 ``` then we can have smaller table data when execute join, because group by has cut lots of data. ## How was this patch tested? uinit test added
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
