Fly-a-Kite opened a new issue, #22489:
URL: https://github.com/apache/datafusion/issues/22489
### Describe the bug
## Summary
DataFusion `53.0.0` returns an empty result for a two-row grouped aggregate
when an inner ordered `LIMIT` is followed by an outer `ORDER BY ... OFFSET 1`.
This may be related to #22190, but the reproducer differs: it does not use a
NULL aggregate sort key or `MIN`/`MAX`. The row loss appears with a non-NULL
grouped result when an inner `ORDER BY/LIMIT` is followed by an outer `ORDER
BY/OFFSET`.
## Environment
- `datafusion==53.0.0`
- `pyarrow==24.0.0`
- Python `3.12.3`
## Reproduction
```python
#!/usr/bin/env python3
from __future__ import annotations
import datafusion
import pyarrow as pa
from datafusion import SessionContext
def register(ctx: SessionContext, name: str, rows: list[dict], schema:
pa.Schema) -> None:
batch = pa.RecordBatch.from_pylist(rows, schema=schema)
ctx.register_record_batches(name, [[batch]])
def main() -> None:
ctx = SessionContext()
register(
ctx,
"t0",
[{"id": 0}, {"id": 1}],
pa.schema([pa.field("id", pa.int64())]),
)
register(
ctx,
"t1",
[{"id": 1, "j": 1}],
pa.schema([
pa.field("id", pa.int64()),
pa.field("j", pa.int64(), nullable=True),
]),
)
base = (
"SELECT t0.id, COUNT(t0.id) AS count_id, COUNT(DISTINCT j) AS
nunique_j "
"FROM t0 LEFT JOIN t1 ON t0.id = t1.id GROUP BY t0.id"
)
control_query = (
f"SELECT * FROM ({base}) q "
"ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j
ASC NULLS LAST "
"OFFSET 1"
)
failing_query = (
f"SELECT * FROM ("
f" SELECT * FROM ({base}) q "
" ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j
DESC NULLS LAST "
" LIMIT 8"
") q2 "
"ORDER BY id DESC NULLS LAST, count_id DESC NULLS LAST, nunique_j
ASC NULLS LAST "
"OFFSET 1"
)
print(f"datafusion={getattr(datafusion, '__version__', 'unknown')}")
print(f"pyarrow={pa.__version__}")
control = ctx.sql(control_query).to_pandas()
failing = ctx.sql(failing_query).to_pandas()
print("control query result:")
print(control)
print("query with inner ORDER BY/LIMIT result:")
print(failing)
assert len(control) == 1 and int(control.iloc[0]["id"]) == 0
assert len(failing) == 1 and int(failing.iloc[0]["id"]) == 0
if __name__ == "__main__":
main()
```
## Expected behavior
Both queries should return the second grouped row after the final ordering
and `OFFSET 1`:
```text
id=0, count_id=1, nunique_j=0
```
## Actual behavior
The control query returns the expected row, but the query with the inner
`ORDER BY ... LIMIT 8` returns an empty result.
Observed output:
```text
control query result:
id count_id nunique_j
0 0 1 0
query with inner ORDER BY/LIMIT result:
Empty DataFrame
Columns: [id, count_id, nunique_j]
Index: []
```
### To Reproduce
_No response_
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]