suibianwanwank opened a new issue, #15032:
URL: https://github.com/apache/datafusion/issues/15032
### Describe the bug
For cases similar to the well-known "count-bug", the correlated subquery
returns the incorrect result.
### To Reproduce
```
> select * from 'data.csv';
+------+---+
| a | b |
+------+---+
| 1 | 2 |
| NULL | 0 |
+------+---+
2 row(s) fetched.
Elapsed 0.003 seconds.
> select e.b ,(select case when max(e2.a) > 10 then 'a' else 'b' end from
'data.csv' e2 where e2.b = e.b+1 ) from 'data.csv' e;
+---+-------------------------------------------------------------------+
| b | CASE WHEN max(e2.a) > Int64(10) THEN Utf8("a") ELSE Utf8("b") END |
+---+-------------------------------------------------------------------+
| 0 | NULL |
| 2 | NULL |
+---+-------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.004 seconds.
```
### Expected behavior
Subqueries with no matches should return 'b' instead of null.
### Additional context
What is count-bug: [Optimization of Nested SQL Queries Revisited
](https://dl.acm.org/doi/pdf/10.1145/38714.38723)
A paper describing this issue: [Parameterized ueries and Nesting
Equivalences](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-2000-31.pdf)
> The well-known "count-bug" is not specific to the count aggregate, and
outer-join does not solve it.
The anomaly can occur on any aggregate function; aggregates need
modification to distiguish empty set from null values; and optimizing out the
outerjoin depends on utilization context
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]