adriangb commented on issue #14993: URL: https://github.com/apache/datafusion/issues/14993#issuecomment-3467635289
I've essentially been working on https://github.com/apache/datafusion/issues/14993#issuecomment-3412932134 The big picture as I see it is: 1. We used to have each format have it's own `ExecutionPlan` (e.g. `ParquetExec`). This gave full control but also resulted in repeated boilerplate (e.g. none of these plans have children so `children()` would always be implemented as `vec![]`). 2. We redesigned to have `struct DataSourceExec`, `trait DataSource`, `trait FileSource` and `trait FileOpener` which abstracted away some of the shared boilerplate but but maybe went a bit too far e.g. by assuming that projections could be handled universally. So the essence of this change is to try to refine what lives were, in particular pushing a good chunk more of the logic for projection pushdown handling into the individual `FileSource`s. To avoid having slightly different code everywhere I would like to end up in a spot where there's essentially three cases: 1. File sources that don't support any sort of pushdown e.g. `JsonSource`. The don't need to do anything, the default implementation in `FileSource` will be to reject any pushdown. 2. File sources that support only pushing down column selection but not expressions (e.g. `CsvSource` which can skip parsing columns based on a `Vec<usize>`. For these I plan to make some helpers (essentially a `ProjectedFileOpener` wrapper that accepts an inner `FileOpener` + a remainder projection to apply on top of the column selection). This should keep the `FileSource` APIs "simple" (you either accept a projection or you don't) while not making each file source have to handle partition columns, etc. 3. File sources like `ParquetSource` (as well as `VortexSource`, etc.) that can do advanced pushdown of expressions will have their own internal handling. Because there is a lot of stuff tied to projections (`EquivalenceProperties`, etc.) this means quite a bit of code churn. I've been working on PRs to clean up rough edges on existing APIs and create abstractions to help this all come together without being disgusting. I've gotten a couple "tracer bullet" changes that successfully got pushdown working but had to comment out a bunch of tests that accessed APIs I removed, didn't handle errors, etc. I think next iteration of that I can post a PR in draft as a preview for discussion. It's hard to break a big change like this up into smaller pieces that make sense on their own along the way. The most helpful thing for now would probably be for folks to try to "grok" this plan and gain consensus on it + help out with review of these related changes. And just bear with me/us as we work through this change together. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
