zhuqi-lucas commented on issue #19654: URL: https://github.com/apache/datafusion/issues/19654#issuecomment-3747800779
> Hey [@xudong963](https://github.com/xudong963) 👋🏻 > > I am working through it in [this](https://github.com/feniljain/datafusion/tree/feat-offset-pushdown) branch of my own fork, you can see a diff [here](https://github.com/apache/datafusion/compare/main...feniljain:datafusion:feat-offset-pushdown?expand=1) to get a better idea of details. To keep it in words, I was able to push down `offset` to `TableScan` in logical optimizer. It can also prune files using it in `ListingTable` now! > > Next, I want to work out how to use remaining offset after file pruning in `file_stream` or a similar place where I can do the whole cycle of tracking offset and reducing it by file's row count. Second part needs to be done regardless cause in case of `filters` being pushed down, we do not take `limit` and `offset` into consideration. > > I was planning of opening a draft POC PR to get feedback on after completing `file_stream` integration and doing some manual tests at the very least 😅 > > I am guessing by design side you mean how to do it in multi-file + filters scenario? For using it in pruning in `ListingTable` it didn't seem that hard as we get a single flat stream of files, and I could stop skipping using offset once the row count is achieved. I am not exactly sure how would it work at `DataSource` level yet. If you have any ideas, I would love to hear them out 😄 Thanks @feniljain! I'm particularly interested in offset pushdown **combined with filters** - this could provide huge performance gains for a common pattern. **Key insight**: When we can prove through row group statistics that a filter will NOT filter any rows (e.g., `date >= '1900-01-01'` where all data is newer), we can safely use row group row counts to compute offset skipping. **Our use case**: - Time-sorted data (ORDER BY matches storage order) - Filter: `> 2020-01-01` (effectively no-op, all data satisfies this) - Offset: 1,000,000 - Row group size: 65,536 **Current behavior**: Read 1,000,000+ rows, discard them **Optimized behavior**: 1. Check each row group's min/max statistics 2. If `min(date) >= '2020-01-01'`, the entire row group passes the filter 3. Use row counts to skip ~15 complete row groups (15 × 65,536 = 983,040 rows) 4. Only read from row group 16, skip remaining 16,960 rows 5. Total rows read: ~65,536 instead of 1,000,000+ This is safe because we can **prove** the filter doesn't reduce row counts for those row groups. What's your opinion about it? Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
