francis0407 commented on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated 
predicate subquery
URL: https://github.com/apache/spark/pull/24344#issuecomment-483272960
 
 
   Hi,
   I'm back with some ideas. I have not found any SQL system that implements 
the conversion. But IMO, this method still can be used with some restrictions.
   
   I think the question we need to pay more attention to is that `EXISTS` is 
not nullable while `IN` could be `null` when using `NOT IN (subquery with 
nulls)`. Remember our goal is to avoid collecting all rows of the subquery's 
result or using a join. Instead, we need to push the outer condition into the 
subquery  (e.g. `a in select b` => `where a = b`). 
   
   In conclusion, we can have three cases here:
   1. For uncorrelated `EXISTS(subquery)`, we rewrite it as `EXISTS(select 1 
from ... where ... limit 1)`
   2. For uncorrelated `a IN(select b ...)`,  we rewrite it as `EXISTS(select 1 
from ... where a = b and ... limit 1)`
   3. For uncorrelated `a NOT IN(select b ...)`,  we can define an aggregate 
function `agg` (not sure about the name) here.  It is initialized with `false` 
, returns `null` if there are nulls in the rows, and returns `true` if the 
result set is not empty and has no nulls. Then we can rewrite the subquery as a 
scalar subquery: `NOT (select agg(b) from ... where (a = b or b is null))`.
    
   
   I'm not sure whether this method is right, corrections are truly welcomed!  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to