[ 
https://issues.apache.org/jira/browse/CALCITE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Solimando updated CALCITE-7317:
------------------------------------------
    Description: 
SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
NULL-safety checks in all IN subqueries, even when both the keys (left side) 
and the subquery result are NOT NULL.

When both are NOT NULL, the evaluation reduces to TRUE/FALSE, so no extra logic 
is needed to account for NULL values.

Current behavior:
{noformat}
SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
FROM emp AS e{noformat}
Generates plan with expensive NULL-safety logic:
{noformat}
SELECT e.deptno,
  CASE
  WHEN ct.c = 0 THEN false
  WHEN e.deptno IS NULL THEN null
  WHEN dt.i IS NOT NULL THEN true
  WHEN ct.ck < ct.c THEN null
  ELSE false
  END
FROM emp AS e
LEFT JOIN (
  (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
  CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
  ON e.deptno = dt.deptno{noformat}
When the key (e.deptno) and the subquery result (deptno) are NOT NULL, the 
expected result is to skip the COUNT aggregates and use simpler anti-join, as 
null safety is not needed:
{noformat}
SELECT e.deptno,
  CASE
  WHEN dt.i IS NOT NULL THEN true
  ELSE false
  END
FROM emp AS e
LEFT JOIN (SELECT DISTINCT deptno, true AS i FROM emp) AS dt
  ON e.deptno = dt.deptno {noformat}
This has a significant performance impact, as it eliminates a CROSS JOIN with 
aggregate computation, significantly reducing query cost.


Existing comments already hint into this optimization 
([SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]):
{noformat}
// If keys are not null we can remove "ct" and simplify to
//
// select e.deptno,
//   case
//   when dt.i is not null then true
//   else false
//   end
// from emp as e
// left join (select distinct deptno, true as i from emp) as dt
//   on e.deptno = dt.deptno{noformat}
Proposed Solution:
Conditionally generate the COUNT aggregates only when we detect we need null 
safety (either the keys or subquery results are nullable).

  was:
SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
NULL-safety checks in all NOT IN subqueries, regardless of whether the needle 
(left operand) can be NULL.

When the needle is guaranteed NOT NULL (e.g., COALESCE(col, default)), these 
checks are unnecessary. NULL values in the haystack (subquery results) cannot 
match a NOT NULL needle anyway, so the expensive COUNT( * ) and COUNT(col) 
aggregates can be skipped.

Current behavior:
{noformat}
SELECT * FROM emp
WHERE COALESCE(deptno, 0) NOT IN (
SELECT deptno FROM dept
){noformat}
Generates:
{noformat}
LEFT JOIN (
(SELECT COUNT as c, COUNT(deptno) as ck FROM dept) – unnecessary
CROSS JOIN (SELECT DISTINCT deptno, true as i FROM dept)
) ON condition{noformat}
Expected behavior:
{noformat}
LEFT JOIN (
SELECT DISTINCT deptno, true as i FROM dept
) ON condition 

{noformat}
With simple filter: IS NULL( i ) instead of the complex OR(=(c, 0), IS NULL( i 
)) AND OR(=(c, 0), >=(ck, c))



Performance impact:
 - Eliminates double UNION ALL execution (once for COUNT, once for join)
 - Removes two aggregate computations
 - Simplifies join predicate

The code already documents this optimization at 
[SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]:
{noformat}
// If keys are not null we can remove "ct" and simplify to
//
// select e.deptno,
//   case
//   when dt.i is not null then true
//   else false
//   end
// from emp as e
// left join (select distinct deptno, true as i from emp) as dt
//   on e.deptno = dt.deptno{noformat}

Proposed solution:
Conditionally generate the COUNT aggregates only when we detect we need null 
safety.


> SubQueryRemoveRule should skip NULL-safety checks when both the keys and the 
> IN subquery are NOT NULL
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7317
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7317
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.41.0
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.42.0
>
>
> SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
> NULL-safety checks in all IN subqueries, even when both the keys (left side) 
> and the subquery result are NOT NULL.
> When both are NOT NULL, the evaluation reduces to TRUE/FALSE, so no extra 
> logic is needed to account for NULL values.
> Current behavior:
> {noformat}
> SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
> FROM emp AS e{noformat}
> Generates plan with expensive NULL-safety logic:
> {noformat}
> SELECT e.deptno,
>   CASE
>   WHEN ct.c = 0 THEN false
>   WHEN e.deptno IS NULL THEN null
>   WHEN dt.i IS NOT NULL THEN true
>   WHEN ct.ck < ct.c THEN null
>   ELSE false
>   END
> FROM emp AS e
> LEFT JOIN (
>   (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
>   CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
>   ON e.deptno = dt.deptno{noformat}
> When the key (e.deptno) and the subquery result (deptno) are NOT NULL, the 
> expected result is to skip the COUNT aggregates and use simpler anti-join, as 
> null safety is not needed:
> {noformat}
> SELECT e.deptno,
>   CASE
>   WHEN dt.i IS NOT NULL THEN true
>   ELSE false
>   END
> FROM emp AS e
> LEFT JOIN (SELECT DISTINCT deptno, true AS i FROM emp) AS dt
>   ON e.deptno = dt.deptno {noformat}
> This has a significant performance impact, as it eliminates a CROSS JOIN with 
> aggregate computation, significantly reducing query cost.
> Existing comments already hint into this optimization 
> ([SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]):
> {noformat}
> // If keys are not null we can remove "ct" and simplify to
> //
> // select e.deptno,
> //   case
> //   when dt.i is not null then true
> //   else false
> //   end
> // from emp as e
> // left join (select distinct deptno, true as i from emp) as dt
> //   on e.deptno = dt.deptno{noformat}
> Proposed Solution:
> Conditionally generate the COUNT aggregates only when we detect we need null 
> safety (either the keys or subquery results are nullable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to