adriangb opened a new issue, #14406:
URL: https://github.com/apache/datafusion/issues/14406
### Describe the bug
Outer limits seem to be able to impact the inner limits of a subquery
### To Reproduce
Run the following python script to create test data:
```python
import os
from datetime import datetime, timedelta
import polars as pl
# Start date
base_date = datetime(1970, 1, 1)
# Create directory structure if it doesn't exist
os.makedirs('parquet_files', exist_ok=True)
# Generate 100 files
for i in range(100):
# Calculate the date for this partition
current_date = base_date + timedelta(days=i)
partition_path = f'parquet_files/day={current_date.strftime("%Y-%m-%d")}'
# Create partition directory
os.makedirs(partition_path, exist_ok=True)
# Create DataFrame with single row
df = pl.DataFrame({'duration': [1.0]})
# Write to parquet file
df.write_parquet(f'{partition_path}/file_{i}.parquet')
```
Now in datafusion-cli (`datafusion-cli 43.0.0` for me) run:
```sql
with selection as (
select *
from 'parquet_files/*'
limit 1
)
select 1 as foo
from selection
order by duration
limit 1000;
```
I get:
```
+-----+
| foo |
+-----+
| 1 |
| 1 |
+-----+
2 row(s) fetched.
```
Which is wrong! It should only ever return 1 row.
This is an MRE of a problem I found in our production stack. In real world
tests it's not 2x the rows, it can be varying numbers, it seems to depend on
the number of partitions chosen to execute with. Setting `SET
datafusion.execution.target_partitions = 1;` the problem goes away. Also
without the outer `limit 1000` the problem goes away.
[parquet_files.zip](https://github.com/user-attachments/files/18630234/parquet_files.zip)
### Expected behavior
_No response_
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]