[I] Similar to the "count-bug" case that produces incorrect results [datafusion]

via GitHub Wed, 05 Mar 2025 08:36:50 -0800


suibianwanwank opened a new issue, #15032:
URL: https://github.com/apache/datafusion/issues/15032


   ### Describe the bug
   
   For cases similar to the well-known "count-bug", the correlated subquery 
returns the incorrect result.
   
   ### To Reproduce
   
   ```
   > select * from 'data.csv';
   +------+---+
   | a    | b |
   +------+---+
   | 1    | 2 |
   | NULL | 0 |
   +------+---+
   2 row(s) fetched. 
   Elapsed 0.003 seconds.
   
   > select e.b ,(select case when max(e2.a) > 10 then 'a' else 'b' end from 
'data.csv' e2 where e2.b = e.b+1 ) from  'data.csv' e;
   +---+-------------------------------------------------------------------+
   | b | CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END |
   +---+-------------------------------------------------------------------+
   | 0 | NULL                                                              |
   | 2 | NULL                                                              |
   +---+-------------------------------------------------------------------+
   2 row(s) fetched. 
   Elapsed 0.004 seconds.
   ```
   
   ### Expected behavior
   
   Subqueries with no matches should return 'b' instead of null.
   
   ### Additional context
   
   What is count-bug: [Optimization of Nested SQL Queries Revisited 
](https://dl.acm.org/doi/pdf/10.1145/38714.38723)
   A paper describing this issue: [Parameterized ueries and Nesting 
Equivalences](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf)
   
   > The well-known "count-bug" is not specific to the count aggregate, and 
outer-join does not solve it.
   The anomaly can occur on any aggregate function; aggregates need 
modification to distiguish empty set from null values; and optimizing out the 
outerjoin depends on utilization context 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Similar to the "count-bug" case that produces incorrect results [datafusion]

Reply via email to