adriangb commented on issue #14993:
URL: https://github.com/apache/datafusion/issues/14993#issuecomment-3412932134

   So basically we already have a projection pushdown physical optimizer rule:
   
   
https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/physical-plan/src/execution_plan.rs#L515-L528
   
   (Note: I think the rule should be pushing down a `Vec<ProjectionExpr>` not a 
reference to the ProjectionExec)
   
   That makes it all the way down to `DataSourceExec`, which just delegates to 
it's `DataSource`:
   
   
https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/datasource/src/source.rs#L316-L329
   
   For our case the `DataSource` is always a `FileScanConfig`, which is where 
things hit a dead end. `FileScanConfig` checks if the projection is "simple" 
column access and if so generates a `Vec<usize>`:
   
   
https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/datasource/src/file_scan_config.rs#L626-L640
   
   (Note that I did change the signature here so that it operates on 
`&[ProjectionExpr]` and not `&ProjectionExec`)
   
   The issue is basically that each `FileSource` wants to do something 
different with the projection:
   - CSV can't do much at all really except skip parsing some columns as it 
reads through the file. Ultimately it needs a `Vec<usize>` to know which 
columns to skip and which ones to read.
   - Parquet can not only skip some columns, it can interpret expressions like 
structure or a shredded variant field access and create a specialized 
`ProjectionMask` to read only those leave nodes.
   
   The way I propose we handle this is:
   - Push the responsibility of deciding if we can accept the projection or not 
into the `FileSource` implementation (e.g. `ParquetSource`).
   - Have the `FileSource` split the projection into two parts:
     - The part that the `FileSource` promises to apply internally (it keeps a 
reference to that).
     - A "remainder" projection that gets bubbled back up to `DataSourceExec` 
and `DataSourceExec` wraps itself in a `ProjectionExec` to apply this remainder 
projection.
   - Now `ParquetSource` has a `Vec<ProjectionExpr>` that it can do fancy 
things with.
   
   Some details:
   - We can make helper functions e.g. 
`split_projection_into_simple_column_indices(projection: &[ProjectionExpr]) -> 
(Vec<usize>, Vec<ProjectionExpr>)` or something like that that formats like CSV 
and Avro that just want a `Vec<usize>` can use.
   - Partition columns can be handled the same way as we do in filters: we 
replace `col("part")` with the literal partition value for each file. This is 
based on a combination of https://github.com/apache/datafusion/pull/16461 and 
https://github.com/apache/datafusion/pull/16789. This means we can delete the 
entire `PartitionColumnProjector`.
   - By bubbling up the "remainder" projection up to DataSourceExec we can 
centralize how that gets applied to the output stream and we avoid polluting 
the domain of the inner traits (`FileSource`, `FileOpener`, `DataSource`) with 
knowledge of `ProjectionExec` / `ExecutionPlan`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to