feniljain commented on issue #19654: URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3747569405
Hey @xudong963 👋🏻 I am working through it in [this](https://github.com/feniljain/datafusion/tree/feat-offset-pushdown) branch of my own fork, you can see a diff [here](https://github.com/apache/datafusion/compare/main...feniljain:datafusion:feat-offset-pushdown?expand=1) to get a better idea of details. To keep it in words, I was able to push down `offset` to `TableScan` in logical optimizer. It can also prune files using it in `ListingTable` now! Next, I want to work out how to use remaining offset after file pruning in `file_stream` or a similar place where I can do the whole cycle of tracking offset and reducing it by file's row count. Second part needs to be done regardless cause in case of `filters` being pushed down, we do not take `limit` and `offset` into consideration. I was planning of opening a draft POC PR to get feedback on after completing `file_stream` integration and doing some manual tests at the very least 😅 I am guessing by design side you mean how to do it in multi-file + filters scenario? For using it in pruning in `ListingTable` it didn't seem that hard as we get a single flat stream of files, and I could stop skipping using offset once the row count is achieved. I am not exactly sure how would it work at `DataSource` level yet. If you have any ideas, I would love to hear them out 😄 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
