adriangb commented on issue #14993:
URL: https://github.com/apache/datafusion/issues/14993#issuecomment-3467635289

   I've essentially been working on 
https://github.com/apache/datafusion/issues/14993#issuecomment-3412932134
   
   The big picture as I see it is:
   1. We used to have each format have it's own `ExecutionPlan` (e.g. 
`ParquetExec`). This gave full control but also resulted in repeated 
boilerplate (e.g. none of these plans have children so `children()` would 
always be implemented as `vec![]`).
   2. We redesigned to have `struct DataSourceExec`, `trait DataSource`, `trait 
FileSource` and `trait FileOpener` which abstracted away some of the shared 
boilerplate but but maybe went a bit too far e.g. by assuming that projections 
could be handled universally.
   
   So the essence of this change is to try to refine what lives were, in 
particular pushing a good chunk more of the logic for projection pushdown 
handling into the individual `FileSource`s. To avoid having slightly different 
code everywhere I would like to end up in a spot where there's essentially 
three cases:
   1. File sources that don't support any sort of pushdown e.g. `JsonSource`. 
The don't need to do anything, the default implementation in `FileSource` will 
be to reject any pushdown.
   2. File sources that support only pushing down column selection but not 
expressions (e.g. `CsvSource` which can skip parsing columns based on a 
`Vec<usize>`. For these I plan to make some helpers (essentially a 
`ProjectedFileOpener` wrapper that accepts an inner `FileOpener` + a remainder 
projection to apply on top of the column selection). This should keep the 
`FileSource` APIs "simple" (you either accept a projection or you don't) while 
not making each file source have to handle partition columns, etc.
   3. File sources like `ParquetSource` (as well as `VortexSource`, etc.) that 
can do advanced pushdown of expressions will have their own internal handling.
   
   Because there is a lot of stuff tied to projections 
(`EquivalenceProperties`, etc.) this means quite a bit of code churn.
   I've been working on PRs to clean up rough edges on existing APIs and create 
abstractions to help this all come together without being disgusting. I've 
gotten a couple "tracer bullet" changes that successfully got pushdown working 
but had to comment out a bunch of tests that accessed APIs I removed, didn't 
handle errors, etc. I think next iteration of that I can post a PR in draft as 
a preview for discussion. It's hard to break a big change like this up into 
smaller pieces that make sense on their own along the way.
   
   The most helpful thing for now would probably be for folks to try to "grok" 
this plan and gain consensus on it + help out with review of these related 
changes. And just bear with me/us as we work through this change together.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to