friendlymatthew opened a new issue, #17095: URL: https://github.com/apache/datafusion/issues/17095
Related to https://github.com/apache/datafusion/pull/17076 Hi, I'm curious about the circular relationship between `FileScanConfig` and `FileSource`. Specifically, how various file sources apply projection when creating a file opener. When create a file stream in `FileScanConfig`, we do the following: https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource/src/file_scan_config.rs#L507-L514 `FileScanConfig` has an inner field of `FileSource`, and a `FileSource` creates a file opener. In both of these steps, we pass in `FileScanConfig`, which feels a bit weird to me. (1. `with_projection(self)`, and 2. `create_file_opener(..., self, ...)`. Looking through the various implementation of `with_projection`, we see some file sources do nothing other than do a deep clone: `ParquetSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-parquet/src/source.rs#L575-L577 `JsonSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-json/src/source.rs#L135-L137 It's not that they don't need to project columns, but the actual projection for these file sources occur in `create_file_opener`, hence why we pass in the `FileScanConfig` again. `ParquetSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-parquet/src/source.rs#L461-L470 `JsonSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-json/src/source.rs#L99-L114 For the other file sources, the projection occurs within `with_projection`: `CsvSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-csv/src/source.rs#L260-L264 `AvroSource` https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-avro/src/source.rs#L98-L102 It would be nice to unify this logic, and make the design/relationship between `FileSource` and `FileScanConfig` much simpler. One idea I had was to move the explicit projection into the `create_file_opener` impls, removing the need to call `.with_projection`. WIth https://github.com/apache/datafusion/pull/17076, we'd just directly call `file_source.create_file_opener`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org