francis0407 commented on issue #24344: [SPARK-27440][SQL] Optimize uncorrelated 
predicate subquery
URL: https://github.com/apache/spark/pull/24344#issuecomment-483642974
 
 
   I remember that for `InSubquery`, we do not distinguish NULL from FALSE 
subquery results currently.  Here is an example:
   We have two tables:
   ```
   scala> spark.sql("select * from t1").show()
   +---+----+
   |t1a| t1b|
   +---+----+
   |  1|   1|
   |  2|null|
   +---+----+
   
   
   scala> spark.sql("select * from t2").show()
   +---+----+
   |t2a| t2b|
   +---+----+
   |  1|   1|
   |  2|null|
   |  3|   3|
   +---+----+
   ```
   Then the query:
   ```
   scala> spark.sql("select * from t1 where (t1b in (select t2b from t2)) is 
null").show()
   +---+---+
   |t1a|t1b|
   +---+---+
   +---+---+
   ```
   
   If we test this in Postgres:
   ```
   test=# select * from t1;
    t1a | t1b 
   -----+-----
      1 |   1
      2 |    
   (2 rows)
   
   test=# select * from t2;
    t2a | t2b 
   -----+-----
      1 |   1
      2 |    
      3 |   3
   (3 rows)
   
   test=# select * from t1 where (t1b in (select t2b from t2)) is null;
    t1a | t1b 
   -----+-----
      2 |    
   (1 row)
   ```
   I'm not sure whether this should be regarded as a BUG.
   
   By the way, I just find that MySQL has an optimization rule 'EXISTS 
Strategy'(https://dev.mysql.com/doc/refman/8.0/en/subquery-optimization-with-exists.html).
 It rewrites some `InSubquery` as `EXISTS` with some restrictions.
   
   As for `where a = b and c = d`, we currently rewrite multi-column 
`InSubquery` as semi-join with condition `a = b and c = d`. When `a` and `c` 
are literals, I think `t1 semi join t2 on a = b and c = d` is equivalent to 
`where a = b and c = d`.
   So how about convert `(a, b) in (select c, d from ... where ...)` into 
`exists(select 1 from ... where a = c and b = d)` and convert `(a, b) not in 
(select c, d from ... where ...)` into `not exists (select 1 from ... where 
(a=c or (a=c) is null) and (b=d or (b=d) is null)`
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to