Fly-a-Kite opened a new issue, #22190:
URL: https://github.com/apache/datafusion/issues/22190
### Describe the bug
## Version
- Python: `3.12.3`
- Python package: `datafusion==53.0.0`
- Arrow input via `pyarrow==24.0.0`
- Platform used for reproduction: Ubuntu 24.04 x86_64, Linux
`6.17.0-23-generic`
## Minimal Reproducer
```python
import pyarrow as pa
from datafusion import SessionContext
ctx = SessionContext()
batch = pa.RecordBatch.from_arrays(
[
pa.array(["a"], type=pa.string()),
pa.array([None], type=pa.int64()),
],
schema=pa.schema(
[
pa.field("g", pa.string(), nullable=True),
pa.field("x", pa.int64(), nullable=True),
]
),
)
ctx.register_record_batches("t0", [[batch]])
base = "SELECT g, MIN(x) AS min_x FROM t0 GROUP BY g"
print(ctx.sql(f"SELECT min_x FROM ({base}) q LIMIT 20").to_pandas())
print(ctx.sql(f"SELECT min_x FROM ({base}) q ORDER BY min_x ASC NULLS LAST
LIMIT 20").to_pandas())
```
## Actual Result
Standalone reproducer output:
```text
datafusion=53.0.0
pyarrow=24.0.0
control:
min_x
0 NaN
top-k:
Empty DataFrame
Columns: [min_x]
Index: []
top-k record-batch rows=0
AssertionError: DataFusion dropped the group whose aggregate sort key is NULL
```
The control query returns the grouped row with a NULL aggregate result:
```text
min_x
0 NaN
```
The query with `ORDER BY min_x ASC NULLS LAST LIMIT 20` returns no rows:
```text
Empty DataFrame
Columns: [min_x]
Index: []
```
With two groups, one having `MIN(x) = NULL` and one having `MIN(x) = 5`, the
`ORDER BY ... LIMIT 20` query returns only the non-NULL sort-key group. The
limit is larger than the number of groups, so no grouped row should be removed.
## Expected Result
Both single-group queries should return one row. `GROUP BY g` forms a group
for `g = 'a'`; `MIN(x)` is NULL because all values in that group are NULL.
Ordering with `NULLS LAST` and a limit larger than the result cardinality
should preserve the row.
## Additional Context
I found this with a cross-backend differential test. In the generated case,
pandas, Polars eager/lazy, DuckDB, SQLite, and an independent DSL reference all
preserved the grouped row whose aggregate sort key is NULL. DataFusion was the
only backend that returned an empty result.
I also ran targeted follow-up checks for the same pattern. The issue
consistently reproduced for NULL `MIN(x)` ordered ascending and NULL `MAX(x)`
ordered descending, while nearby controls such as plain NULL column top-k, NULL
group key with non-NULL aggregate, and NULL `SUM(x)` / `AVG(x)` aggregate top-k
preserved the row.
I can share the standalone reproducer, boundary-check script, and full
generated artifacts if useful.
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]