friendlymatthew opened a new issue, #17095:
URL: https://github.com/apache/datafusion/issues/17095

   Related to https://github.com/apache/datafusion/pull/17076
   
   Hi, I'm curious about the circular relationship between `FileScanConfig` and 
`FileSource`. Specifically, how various file sources apply projection when 
creating a file opener.
   
   When create a file stream in `FileScanConfig`, we do the following: 
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource/src/file_scan_config.rs#L507-L514
   
   `FileScanConfig` has an inner field of `FileSource`, and a `FileSource` 
creates a file opener. In both of these steps, we pass in `FileScanConfig`, 
which feels a bit weird to me. (1. `with_projection(self)`, and 2. 
`create_file_opener(..., self, ...)`. 
   
   Looking through the various implementation of `with_projection`, we see some 
file sources do nothing other than do a deep clone:
   
   `ParquetSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-parquet/src/source.rs#L575-L577
   
   `JsonSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-json/src/source.rs#L135-L137
   
   It's not that they don't need to project columns, but the actual projection 
for these file sources occur in `create_file_opener`, hence why we pass in the 
`FileScanConfig` again. 
   
   `ParquetSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-parquet/src/source.rs#L461-L470
   
   `JsonSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-json/src/source.rs#L99-L114
   
   
   For the other file sources, the projection occurs within `with_projection`: 
   
   `CsvSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-csv/src/source.rs#L260-L264
   
   `AvroSource`
   
https://github.com/apache/datafusion/blob/173989cc2fb55c30cd174b520754812ea408e00b/datafusion/datasource-avro/src/source.rs#L98-L102
   
   
   It would be nice to unify this logic, and make the design/relationship 
between `FileSource` and `FileScanConfig` much simpler. One idea I had was to 
move the explicit projection into the `create_file_opener` impls, removing the 
need to call `.with_projection`. WIth 
https://github.com/apache/datafusion/pull/17076, we'd just directly call 
`file_source.create_file_opener`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to