asolimando commented on PR #4668:
URL: https://github.com/apache/calcite/pull/4668#issuecomment-3618906931

   > some of the tests seem to have different results, so maybe something is 
wrong
   
   I was under the impression that having the key NOT NULL was enough to lift 
the NULL safety checks, but I was wrong, the wrong results in the quidem test 
for subquery made me realize it.
   
   This is my example from the Jira ticket (taken from the comments in the 
class suggesting the optimization):
   ```
   SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
   FROM emp AS e
   ```
   
   This is the current rewrite without this improvement, let's annotate what 
each part does:
   ```
   SELECT e.deptno,
       CASE
       WHEN ct.c = 0 THEN false                     -- (1) empty check
       WHEN e.deptno IS NULL THEN null      -- (2) key NULL check
       WHEN dt.i IS NOT NULL THEN true      -- (3) match found
       WHEN ct.ck < ct.c THEN null                 -- (4) NULLs exist check
       ELSE false                                                -- (5) no match
       END
     FROM emp AS e
     LEFT JOIN (
       (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
       CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
       ON e.deptno = dt.deptno
   ```
   
   1. used to see if the subquery is empty (as `COUNT(*)` returns one row 
anyway): DROP (we will re-check this later)
   2. if we know the key is NOT NULL, we can drop: DROP
   3. this is needed in any case, we wouldn't know when we matched otherwise: 
KEEP
   4. we can drop, if both sides are NOT NULL, they are trivially identical: 
DROP
   5. needed to as it's the fallback: KEEP
   
   So we are left with:
   ```
   SELECT e.deptno,
       CASE
       WHEN dt.i IS NOT NULL THEN true      -- (3) match found
       ELSE false                                                -- (5) no match
       END
     FROM emp AS e
     LEFT JOIN (
       (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
       CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
       ON e.deptno = dt.deptno
   ```
   
   and we can see that we don't use `ct` anymore, we can drop it, which leaves 
us with:
   
   ```
   SELECT e.deptno,
       CASE
       WHEN dt.i IS NOT NULL THEN true      -- (3) match found
       ELSE false                                                -- (5) no match
       END
     FROM emp AS e
     LEFT JOIN (
       SELECT DISTINCT deptno, true AS i FROM emp)
     AS dt ON e.deptno = dt.deptno
   ```
   
   We were wondering about 1. (empty subquery check), but we can see it's not 
needed here, if `emp` is empty, then we get all `dt.i` to be `NULL`, so we get 
all `false` for every row.
   
   It's unfortunate that the example in the original comment uses twice the 
same table, but it's not hard to see that the whole reasoning hold if you 
replace the `emp` in the subquery table with something else.
   
   So in the last commit I have pushed the change checking for both the keys 
and the subquery columns to be NOT NULL, and thus applying the optimization 
described, which is not safe if either of the two are NULL, as covered by tests 
in `RelOptRuleTest`.
   
   In the top-most commit I have left some separated test changes which are 
only due to a different order of OR operands after the `CASE` to `OR` rewrite, 
I am not sure where this is coming from, it doesn't alter the semantics.
   
   I hope this helps double checking my understanding and what I implemented in 
this PR, since subqueries are always a tricky subject.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to