[ 
https://issues.apache.org/jira/browse/CALCITE-7317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alessandro Solimando updated CALCITE-7317:
------------------------------------------
    Description: 
SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
NULL-safety checks in all IN subqueries, even when both the keys (left side) 
and the subquery columns are NOT NULL.

When both are NOT NULL, the evaluation reduces to TRUE/FALSE, so no extra logic 
is needed to account for NULL values.

Current behavior:
{noformat}
SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
FROM emp AS e{noformat}
Generates plan with expensive NULL-safety logic:
{noformat}
SELECT e.deptno,
  CASE
  WHEN ct.c = 0 THEN false
  WHEN e.deptno IS NULL THEN null
  WHEN dt.i IS NOT NULL THEN true
  WHEN ct.ck < ct.c THEN null
  ELSE false
  END
FROM emp AS e
LEFT JOIN (
  (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
  CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
  ON e.deptno = dt.deptno{noformat}
When the key (e.deptno) and the subquery column (deptno) are NOT NULL, the 
expected result is to skip the COUNT aggregates and use simpler anti-join, as 
null safety is not needed:
{noformat}
SELECT e.deptno,
  CASE
  WHEN dt.i IS NOT NULL THEN true
  ELSE false
  END
FROM emp AS e
LEFT JOIN (SELECT DISTINCT deptno, true AS i FROM emp) AS dt
  ON e.deptno = dt.deptno {noformat}
This has a significant performance impact, as it eliminates a CROSS JOIN with 
aggregate computation, significantly reducing query cost.

Existing comments already hint into this optimization 
([SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]):
{noformat}
// If keys are not null we can remove "ct" and simplify to
//
// select e.deptno,
//   case
//   when dt.i is not null then true
//   else false
//   end
// from emp as e
// left join (select distinct deptno, true as i from emp) as dt
//   on e.deptno = dt.deptno{noformat}
Proposed Solution:
Conditionally generate the COUNT aggregates only when we detect we need null 
safety (either the keys or the subquery columns are nullable).

  was:
SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
NULL-safety checks in all IN subqueries, even when both the keys (left side) 
and the subquery result are NOT NULL.

When both are NOT NULL, the evaluation reduces to TRUE/FALSE, so no extra logic 
is needed to account for NULL values.

Current behavior:
{noformat}
SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
FROM emp AS e{noformat}
Generates plan with expensive NULL-safety logic:
{noformat}
SELECT e.deptno,
  CASE
  WHEN ct.c = 0 THEN false
  WHEN e.deptno IS NULL THEN null
  WHEN dt.i IS NOT NULL THEN true
  WHEN ct.ck < ct.c THEN null
  ELSE false
  END
FROM emp AS e
LEFT JOIN (
  (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
  CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
  ON e.deptno = dt.deptno{noformat}
When the key (e.deptno) and the subquery result (deptno) are NOT NULL, the 
expected result is to skip the COUNT aggregates and use simpler anti-join, as 
null safety is not needed:
{noformat}
SELECT e.deptno,
  CASE
  WHEN dt.i IS NOT NULL THEN true
  ELSE false
  END
FROM emp AS e
LEFT JOIN (SELECT DISTINCT deptno, true AS i FROM emp) AS dt
  ON e.deptno = dt.deptno {noformat}
This has a significant performance impact, as it eliminates a CROSS JOIN with 
aggregate computation, significantly reducing query cost.


Existing comments already hint into this optimization 
([SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]):
{noformat}
// If keys are not null we can remove "ct" and simplify to
//
// select e.deptno,
//   case
//   when dt.i is not null then true
//   else false
//   end
// from emp as e
// left join (select distinct deptno, true as i from emp) as dt
//   on e.deptno = dt.deptno{noformat}
Proposed Solution:
Conditionally generate the COUNT aggregates only when we detect we need null 
safety (either the keys or subquery results are nullable).


> SubQueryRemoveRule should skip NULL-safety checks when both the keys and the 
> IN subquery are NOT NULL
> -----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-7317
>                 URL: https://issues.apache.org/jira/browse/CALCITE-7317
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.41.0
>            Reporter: Alessandro Solimando
>            Assignee: Alessandro Solimando
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.42.0
>
>
> SubQueryRemoveRule.rewriteIn() currently generates COUNT aggregates for 
> NULL-safety checks in all IN subqueries, even when both the keys (left side) 
> and the subquery columns are NOT NULL.
> When both are NOT NULL, the evaluation reduces to TRUE/FALSE, so no extra 
> logic is needed to account for NULL values.
> Current behavior:
> {noformat}
> SELECT e.deptno, e.deptno IN (SELECT deptno FROM emp)
> FROM emp AS e{noformat}
> Generates plan with expensive NULL-safety logic:
> {noformat}
> SELECT e.deptno,
>   CASE
>   WHEN ct.c = 0 THEN false
>   WHEN e.deptno IS NULL THEN null
>   WHEN dt.i IS NOT NULL THEN true
>   WHEN ct.ck < ct.c THEN null
>   ELSE false
>   END
> FROM emp AS e
> LEFT JOIN (
>   (SELECT COUNT(*) AS c, COUNT(deptno) AS ck FROM emp) AS ct
>   CROSS JOIN (SELECT DISTINCT deptno, true AS i FROM emp)) AS dt
>   ON e.deptno = dt.deptno{noformat}
> When the key (e.deptno) and the subquery column (deptno) are NOT NULL, the 
> expected result is to skip the COUNT aggregates and use simpler anti-join, as 
> null safety is not needed:
> {noformat}
> SELECT e.deptno,
>   CASE
>   WHEN dt.i IS NOT NULL THEN true
>   ELSE false
>   END
> FROM emp AS e
> LEFT JOIN (SELECT DISTINCT deptno, true AS i FROM emp) AS dt
>   ON e.deptno = dt.deptno {noformat}
> This has a significant performance impact, as it eliminates a CROSS JOIN with 
> aggregate computation, significantly reducing query cost.
> Existing comments already hint into this optimization 
> ([SubQueryRemoveRule.java:598-607|https://github.com/apache/calcite/blob/f2df771d86f52e15fba8f626eaeb8b0b9b944823/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L598-L607]):
> {noformat}
> // If keys are not null we can remove "ct" and simplify to
> //
> // select e.deptno,
> //   case
> //   when dt.i is not null then true
> //   else false
> //   end
> // from emp as e
> // left join (select distinct deptno, true as i from emp) as dt
> //   on e.deptno = dt.deptno{noformat}
> Proposed Solution:
> Conditionally generate the COUNT aggregates only when we detect we need null 
> safety (either the keys or the subquery columns are nullable).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to