Re: [I] Poor performance of parquet pushdown with offset [datafusion]

via GitHub Fri, 16 Jan 2026 14:09:12 -0800


alamb commented on issue #19654:
URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3761992045


   > This means for each and every partition, we apply limit separately, I 
think this is an incorrect implementation? 😓 Do correct me if I am 
misunderstanding anything here. 🙈
   
   Why do you think this an incorrect implementation? 
   
   In addition to the datasource limit, there is also then a LimitExec 
somewhere above in the plan too - the datasources may return more than the 
requested rows.  I think at least one reason that each partition has a limit is 
that the datasource may not have enough rows (e.g. think `LIMIT 1000` and the 
file only has 500 rows -- you need to scan across all the files)
   
   > Question 1 --> Given all claims verify, how do we want to proceed? I can 
get above PR till ListingTable merged, and then we can fix limit pushdown. 
After which I can work out offset pushdown too. 💃🏻
   
   My suggestion is
   1. Change the pushdown filter rule to preserve both limit and an `offset` 
and propagate that to the DataSource (right now it only has `limit`) but don't 
use it yet
   2. Update the Parquet opener to use the offset (e.g. I think you could 
implement some update to the `ReadPlan` that is created for each file, and 
potentially push the offset calculation directly into the parquet reader
   
   Pushing down offsets instead of just limits may be tricky to avoid a bg 
breaking API chainge -- (though maybe you could add some backwards compatible 
api `ExecutionPlan::with_limit_offset` for example)
   
   Unfortunately I am not likely to be able to help drive this work as I am 
pretty overwhelmed with review stuff now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Poor performance of parquet pushdown with offset [datafusion]

Reply via email to