adriangb commented on issue #14993: URL: https://github.com/apache/datafusion/issues/14993#issuecomment-3412932134
So basically we already have a projection pushdown physical optimizer rule: https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/physical-plan/src/execution_plan.rs#L515-L528 (Note: I think the rule should be pushing down a `Vec<ProjectionExpr>` not a reference to the ProjectionExec) That makes it all the way down to `DataSourceExec`, which just delegates to it's `DataSource`: https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/datasource/src/source.rs#L316-L329 For our case the `DataSource` is always a `FileScanConfig`, which is where things hit a dead end. `FileScanConfig` checks if the projection is "simple" column access and if so generates a `Vec<usize>`: https://github.com/apache/datafusion/blob/337378ab81f6c7dab7da9000124c554d3b7ee568/datafusion/datasource/src/file_scan_config.rs#L626-L640 (Note that I did change the signature here so that it operates on `&[ProjectionExpr]` and not `&ProjectionExec`) The issue is basically that each `FileSource` wants to do something different with the projection: - CSV can't do much at all really except skip parsing some columns as it reads through the file. Ultimately it needs a `Vec<usize>` to know which columns to skip and which ones to read. - Parquet can not only skip some columns, it can interpret expressions like structure or a shredded variant field access and create a specialized `ProjectionMask` to read only those leave nodes. The way I propose we handle this is: - Push the responsibility of deciding if we can accept the projection or not into the `FileSource` implementation (e.g. `ParquetSource`). - Have the `FileSource` split the projection into two parts: - The part that the `FileSource` promises to apply internally (it keeps a reference to that). - A "remainder" projection that gets bubbled back up to `DataSourceExec` and `DataSourceExec` wraps itself in a `ProjectionExec` to apply this remainder projection. - Now `ParquetSource` has a `Vec<ProjectionExpr>` that it can do fancy things with. Some details: - We can make helper functions e.g. `split_projection_into_simple_column_indices(projection: &[ProjectionExpr]) -> (Vec<usize>, Vec<ProjectionExpr>)` or something like that that formats like CSV and Avro that just want a `Vec<usize>` can use. - Partition columns can be handled the same way as we do in filters: we replace `col("part")` with the literal partition value for each file. This is based on a combination of https://github.com/apache/datafusion/pull/16461 and https://github.com/apache/datafusion/pull/16789. This means we can delete the entire `PartitionColumnProjector`. - By bubbling up the "remainder" projection up to DataSourceExec we can centralize how that gets applied to the output stream and we avoid polluting the domain of the inner traits (`FileSource`, `FileOpener`, `DataSource`) with knowledge of `ProjectionExec` / `ExecutionPlan`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
