friendlymatthew commented on issue #15952: URL: https://github.com/apache/datafusion/issues/15952#issuecomment-3181185291
Hi, I've been staring at `FileScanConfig` for the past day now and had some thoughts about the redesign. ## One approach: create a new struct `FileScan` over the config One thing I tried was adding a wrapper around `FileScanConfig` called `FileScan`. `FileScan` would impl `DataSource` and look something like: ```rs struct FileScan { file_source: Arc<dyn FileSource>, config: FileScanConfig } ``` This way, rather than passing the parent struct (`FileScanConfig`) into the child struct (`FileSource`) when doing file operations, we'd pass a sibling struct. And _technically_ by extracting `file_source` out of `FileScanConfig`, the data execution flow shouldn't touch the config. However, I'm dubious whether this refactor will do anything meaningful. We're introducing another struct to be explicit about what implements `DataSource`, but we haven't really done anything major to `FileScanConfig`. <br> ## Another approach: let `ParquetSource`, `JsonSource`, etc directly impl `DataSource` From what I understand, prior versions had file-specific execution plans (`ParquetExec`, `JsonExec`, etc..) and the existing version was written to bridge similar functionality. However, it seems like each file source has pretty different characteristics/implementations. Now, we have to bear this abstraction cost by maintaining a format-agnostic config (`FileScanConfig`) with format-specific fields like `new_lines_in_values` or `expr_adapter_factory` that is only specific to 1 file source. We also introduce various call sites throughout the `FileScanConfig` -> `FileSource` relationship that does the same thing like setting the batch size/statistics/projection. I propose we directly implement `DataSource` for all structs that impl `FileSource`. Each format already has most of what they need to implement `DataSource` already! This way, we'll at least narrow down which fields are shared throughout all file sources. Note, the refactor will still allow us to do things like https://github.com/apache/datafusion/issues/17095 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org