friendlymatthew commented on issue #15952:
URL: https://github.com/apache/datafusion/issues/15952#issuecomment-3181185291

   Hi, I've been staring at `FileScanConfig` for the past day now and had some 
thoughts about the redesign. 
   
   ## One approach: create a new struct `FileScan` over the config
   
   One thing I tried was adding a wrapper around `FileScanConfig` called 
`FileScan`. `FileScan` would impl `DataSource` and look something like: 
   
   ```rs
   struct FileScan {
       file_source: Arc<dyn FileSource>,
       config: FileScanConfig
   }
   ```
   
   This way, rather than passing the parent struct (`FileScanConfig`) into the 
child struct (`FileSource`) when doing file operations, we'd pass a sibling 
struct. And _technically_ by extracting `file_source` out of `FileScanConfig`, 
the data execution flow shouldn't touch the config. 
   
   However, I'm dubious whether this refactor will do anything meaningful. 
We're introducing another struct to be explicit about what implements 
`DataSource`, but we haven't really done anything major to `FileScanConfig`.
   
   <br>
   
   ## Another approach: let `ParquetSource`, `JsonSource`, etc directly impl 
`DataSource`
   
   From what I understand, prior versions had file-specific execution plans 
(`ParquetExec`, `JsonExec`, etc..) and the existing version was written to 
bridge similar functionality.
   
   However, it seems like each file source has pretty different 
characteristics/implementations. Now, we have to bear this abstraction cost by 
maintaining a format-agnostic config (`FileScanConfig`) with format-specific 
fields like `new_lines_in_values` or `expr_adapter_factory` that is only 
specific to 1 file source. We also introduce various call sites throughout the 
`FileScanConfig` -> `FileSource` relationship that does the same thing like 
setting the batch size/statistics/projection. 
   
   I propose we directly implement `DataSource` for all structs that impl 
`FileSource`. Each format already has most of what they need to implement 
`DataSource` already! This way, we'll at least narrow down which fields are 
shared throughout all file sources. Note, the refactor will still allow us to 
do things like https://github.com/apache/datafusion/issues/17095
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to