[jira] [Updated] (CALCITE-6778) Inconsistent null behavior with correlated queries + quantify operators

ASF GitHub Bot (Jira) Mon, 13 Jan 2025 09:33:08 -0800


     [ 
https://issues.apache.org/jira/browse/CALCITE-6778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


ASF GitHub Bot updated CALCITE-6778:
------------------------------------
    Labels: pull-request-available  (was: )

> Inconsistent null behavior with correlated queries + quantify operators
> -----------------------------------------------------------------------
>
>                 Key: CALCITE-6778
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6778
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.38.0
>            Reporter: Rafael Acevedo
>            Priority: Major
>              Labels: pull-request-available
>
> Hi all!
> I believe I've found a bug in the quantify operators. The following queries 
> should be equivalent, but yield different results:
> Query 1:
> SQL:
> {code:sql}
> --returns null, as expected
> SELECT TRUE = ALL (
>     SELECT b
>     FROM (SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) AS x1(b))
>     AS test;
> {code}
> Plan:
> {code}
> EnumerableCalc(expr#0..2=[{inputs}], expr#3=[0], expr#4=[=($t0, $t3)], 
> expr#5=[<>($t1, $t0)], expr#6=[1], expr#7=[<=($t1, $t6)], 
> expr#8=[null:BOOLEAN], expr#9=[AND($t5, $t7, $t2, $t8)], expr#10=[=($t1, 
> $t6)], expr#11=[=($t1, $t0)], expr#12=[>($t1, $t6)], expr#13=[OR($t11, 
> $t12)], expr#14=[AND($t10, $t2, $t13)], expr#15=[OR($t4, $t9, $t14)], 
> TEST=[$t15])
>   EnumerableAggregate(group=[{}], c=[COUNT()], d=[COUNT($0)], m=[MAX($0)])
>       EnumerableValues(tuples=[[{ true }, { null }]])
>       
> {code}
> Query 2:
> SQL:
> {code:sql}
> --returns true but should be null
> WITH tb as (select array(SELECT * FROM (VALUES (TRUE), (NULL)) as x(a)) as a)
> SELECT TRUE = ALL (
>     SELECT b
>     FROM UNNEST(a) AS x1(b)
> ) AS test
> FROM tb;
> {code}
> Plan:
> {code}
> EnumerableCalc(expr#0..4=[{inputs}], expr#5=[IS NULL($t4)], expr#6=[0], 
> expr#7=[=($t1, $t6)], expr#8=[OR($t5, $t7)], expr#9=[IS TRUE($t8)], 
> expr#10=[<>($t2, $t1)], expr#11=[1], expr#12=[<=($t2, $t11)], 
> expr#13=[AND($t10, $t12)], expr#14=[IS TRUE($t13)], expr#15=[null:BOOLEAN], 
> expr#16=[IS NOT TRUE($t8)], expr#17=[AND($t14, $t3, $t15, $t16)], 
> expr#18=[=($t2, $t11)], expr#19=[IS TRUE($t18)], expr#20=[IS NOT TRUE($t13)], 
> expr#21=[AND($t19, $t3, $t16, $t20)], expr#22=[OR($t9, $t17, $t21)], 
> TEST=[$t22])
>   EnumerableCorrelate(correlation=[$cor0], joinType=[left], 
> requiredColumns=[{0}])
>       EnumerableCollect(field=[x])
>             EnumerableValues(tuples=[[{ true }, { null }]])
>                 EnumerableCalc(expr#0..2=[{inputs}], expr#3=[true], 
> proj#0..3=[{exprs}])
>                       EnumerableAggregate(group=[{}], c=[COUNT() FILTER $3], 
> d=[COUNT($0) FILTER $2], m=[MIN($1) FILTER $3])
>                               EnumerableCalc(expr#0..2=[{inputs}], 
> expr#3=[0], expr#4=[=($t2, $t3)], expr#5=[1], expr#6=[=($t2, $t5)], 
> proj#0..1=[{exprs}], $g_0=[$t4], $g_1=[$t6])
>                                         EnumerableAggregate(group=[{0}], 
> groups=[[{0}, {}]], m=[MAX($0)], $g=[GROUPING($0)])
>                                                     EnumerableUncollect
>                                                                   
> EnumerableCalc(expr#0=[{inputs}], expr#1=[$cor0], expr#2=[$t1.A], A=[$t2])
>                                                                               
>     EnumerableValues(tuples=[[{ 0 }]])
>                                                                               
>     
> {code}
> Looking at the code, {{SubQueryRemoveRule}} produces different plans because 
> query 2 is correlated. [Code 
> pointer|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L230].
> A possible cause for different values is that for correlated queries (query 2 
> in this case) {{SubQueryRemoveRule}} 
> [uses|https://github.com/apache/calcite/blob/575fc1a583b3f5b5febdee40da4c5cb46c9022b1/core/src/main/java/org/apache/calcite/rel/rules/SubQueryRemoveRule.java#L408-L409]
>  {{count(distinct *)}} and {{count(distinct <col>)}}. Non-correlated queries 
> (query 1 in this case) use {{count ( * )}} and {{count(<col>)}}, which behave 
> differently with null values. I assume there's a reason to use distinct, 
> right?
> One possible change is modifying the correlated case to use non-distinct 
> counts. I tested this change locally and no test breaks, but I'd like to get 
> some better understanding of the reasoning behind using distinct.
> Can anyone help me? maybe [~nobigo]?
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (CALCITE-6778) Inconsistent null behavior with correlated queries + quantify operators

Reply via email to