[GitHub] [spark] windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy Placement in Intersect Distinct

GitBox Mon, 25 Mar 2019 20:33:19 -0700

windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy 
Placement in Intersect Distinct
URL: https://github.com/apache/spark/pull/24215
 
 
   ## What changes were proposed in this pull request?
   
   Intersect  operator will be replace by Left Semi Join in Optimizer.
   
   for example:
   ```
   SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
    ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
a2<=>b2
   ```
   
   if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
the table data before
   Join by place groupby operator under join, that is 
   ```
   ==>  
   SELECT a1, a2 FROM 
      (SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
      LEFT SEMI JOIN 
      (SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
   ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
   ```
   
   then we can have smaller table data when execute join, because  group by has 
cut lots of 
    data.
    
   ## How was this patch tested?
   uinit test added


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] windpiger opened a new pull request #24215: [SPARK-27229][SQL] GroupBy Placement in Intersect Distinct

Reply via email to