francis0407 edited a comment on issue #24344: [SPARK-27440][SQL] Optimize 
uncorrelated predicate subquery
URL: https://github.com/apache/spark/pull/24344#issuecomment-495108399
 
 
   Thanks @dilipbiswal @cloud-fan .
   I'm ok for trying these, just wanna make contributions to the project. But I 
think we can do some deep discussion here, to figure out which method could be 
better. 
   First, I think I'd better conclude what we have discussed in this PR.
   
   * At the beginning, I tried to transform all the predicate subquery to 
EXISTS. But in 
https://github.com/apache/spark/pull/24344#issuecomment-483642974,  we found a 
bug in the current implementation of InSubquery (we're not correctly dealing 
with InSubquery for nulls), and opened a new issue 
[SPARK-27572](https://issues.apache.org/jira/browse/SPARK-27572?filter=-2). In 
short, not all of the InSubquery can be converted to semi/anti join or Exists 
(see the example in 
https://github.com/apache/spark/pull/24344#issuecomment-483642974). After 
realized this, I gave up converting InSubquery to Exists, but tried adding 
physical plan for them.
   
   * Another discussion is about the optimization of non-correlated subquery. I 
tried optimizing EXISTS using `project 1, limit 1` to reduce the result set, 
and optimizing InSubquery using `push down the left value, project the equation 
and use 'distinct'`. With these optimization, the size of the result set can 
only be **1 or 2(null or true)** , and all of the calculation is done in the 
executor side. But after @cloud-fan 's reminding, I realize that this can be 
made more generally for semi/anti join.
   
   Now we discuss the optimization for non-correlated semi/anti join.
   
   If I'm not mistaken, I think @cloud-fan said 'turn this join to a filter' 
means we can use a physical plan for the non-correlated semi-join (actually, 
it's the same as EXISTS). That is a great idea, much better than the idea using 
in this PR! It's more general, and extensible. It can still be available when 
the NULL BUG is fixed. 
   
   I suggest we might close this PR and the issue, and open another one for the 
optimization of non-correlated semi/anti join (emmm... not sure about the 
name). What do you think?
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to