Re: [I] Poor performance of parquet pushdown with offset [datafusion]

via GitHub Tue, 13 Jan 2026 19:51:10 -0800


feniljain commented on issue #19654:
URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3747569405


   Hey @xudong963 👋🏻
   
   I am working through it in 
[this](https://github.com/feniljain/datafusion/tree/feat-offset-pushdown) 
branch of my own fork, you can see a diff 
[here](https://github.com/apache/datafusion/compare/main...feniljain:datafusion:feat-offset-pushdown?expand=1)
 to get a better idea of details. To keep it in words, I was able to push down 
`offset` to `TableScan` in logical optimizer. It can also prune files using it 
in `ListingTable` now!
   
   Next, I want to work out how to use remaining offset after file pruning in 
`file_stream` or a similar place where I can do the whole cycle of tracking 
offset and reducing it by file's row count. Second part needs to be done 
regardless cause in case of `filters` being pushed down, we do not take `limit` 
and `offset` into consideration.
   
   I was planning of opening a draft POC PR to get feedback on after completing 
`file_stream` integration and doing some manual tests at the very least 😅
   
   I am guessing by design side you mean how to do it in multi-file + filters 
scenario? For using it in pruning in `ListingTable` it didn't seem that hard as 
we get a single flat stream of files, and I could stop skipping using offset 
once the row count is achieved. I am not exactly sure how would it work at 
`DataSource` level yet. If you have any ideas, I would love to hear them out 😄


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Poor performance of parquet pushdown with offset [datafusion]

Reply via email to