adriangb opened a new issue, #14406:
URL: https://github.com/apache/datafusion/issues/14406

   ### Describe the bug
   
   Outer limits seem to be able to impact the inner limits of a subquery
   
   ### To Reproduce
   
   Run the following python script to create test data:
   
   ```python
   import os
   from datetime import datetime, timedelta
   
   import polars as pl
   
   # Start date
   base_date = datetime(1970, 1, 1)
   
   # Create directory structure if it doesn't exist
   os.makedirs('parquet_files', exist_ok=True)
   
   # Generate 100 files
   for i in range(100):
       # Calculate the date for this partition
       current_date = base_date + timedelta(days=i)
       partition_path = f'parquet_files/day={current_date.strftime("%Y-%m-%d")}'
   
       # Create partition directory
       os.makedirs(partition_path, exist_ok=True)
   
       # Create DataFrame with single row
       df = pl.DataFrame({'duration': [1.0]})
   
       # Write to parquet file
       df.write_parquet(f'{partition_path}/file_{i}.parquet')
   ```
   
   Now in datafusion-cli (`datafusion-cli 43.0.0` for me) run:
   
   ```sql
   with selection as (
       select *
       from 'parquet_files/*'
       limit 1
   )
   select 1 as foo
   from selection
   order by duration
   limit 1000;
   ```
   
   I get:
   
   ```
   +-----+
   | foo |
   +-----+
   | 1   |
   | 1   |
   +-----+
   2 row(s) fetched.
   ```
   
   Which is wrong! It should only ever return 1 row.
   
   This is an MRE of a problem I found in our production stack. In real world 
tests it's not 2x the rows, it can be varying numbers, it seems to depend on 
the number of partitions chosen to execute with. Setting `SET 
datafusion.execution.target_partitions = 1;` the problem goes away. Also 
without the outer `limit 1000` the problem goes away.
   
   
[parquet_files.zip](https://github.com/user-attachments/files/18630234/parquet_files.zip)
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to