asolimando commented on PR #4668:
URL: https://github.com/apache/calcite/pull/4668#issuecomment-3618906931
> some of the tests seem to have different results, so maybe something is
wrong
I was under the impression that having the key NOT NULL was enough to lift
the NULL safety checks, but I was wrong, the wrong results in the quidem test
for subquery made me realize it.
This is my example from the Jira ticket (taken from the comments in the
class suggesting the optimization):
```
SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
FROM emp AS e
```
This is the current rewrite without this improvement, let's annotate what
each part does:
```
SELECT e.deptno,
CASE
WHEN ct.c = 0 THEN false -- (1) empty check
WHEN e.deptno IS NULL THEN null -- (2) key NULL check
WHEN dt.i IS NOT NULL THEN true -- (3) match found
WHEN ct.ck < ct.c THEN null -- (4) NULLs exist check
ELSE false -- (5) no match
END
FROM emp AS e
LEFT JOIN (
(SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
ON e.deptno = dt.deptno
```
1. used to see if the subquery is empty (as `COUNT(*)` returns one row
anyway): DROP (we will re-check this later)
2. if we know the key is NOT NULL, we can drop: DROP
3. this is needed in any case, we wouldn't know when we matched otherwise:
KEEP
4. we can drop, if both sides are NOT NULL, they are trivially identical:
DROP
5. needed to as it's the fallback: KEEP
So we are left with:
```
SELECT e.deptno,
CASE
WHEN dt.i IS NOT NULL THEN true -- (3) match found
ELSE false -- (5) no match
END
FROM emp AS e
LEFT JOIN (
(SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
ON e.deptno = dt.deptno
```
and we can see that we don't use `ct` anymore, we can drop it, which leaves
us with:
```
SELECT e.deptno,
CASE
WHEN dt.i IS NOT NULL THEN true -- (3) match found
ELSE false -- (5) no match
END
FROM emp AS e
LEFT JOIN (
SELECT DISTINCT deptno, true AS i FROM emp)
AS dt ON e.deptno = dt.deptno
```
We were wondering about 1. (empty subquery check), but we can see it's not
needed here, if `emp` is empty, then we get all `dt.i` to be `NULL`, so we get
all `false` for every row.
It's unfortunate that the example in the original comment uses twice the
same table, but it's not hard to see that the whole reasoning hold if you
replace the `emp` in the subquery table with something else.
So in the last commit I have pushed the change checking for both the keys
and the subquery columns to be NOT NULL, and thus applying the optimization
described, which is not safe if either of the two are NULL, as covered by tests
in `RelOptRuleTest`.
In the top-most commit I have left some separated test changes which are
only due to a different order of OR operands after the `CASE` to `OR` rewrite,
I am not sure where this is coming from, it doesn't alter the semantics.
I hope this helps double checking my understanding and what I implemented in
this PR, since subqueries are always a tricky subject.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]