friendlymatthew commented on issue #15952:
URL: https://github.com/apache/datafusion/issues/15952#issuecomment-3181185291
Hi, I've been staring at `FileScanConfig` for the past day now and had some
thoughts about the redesign.
## One approach: create a new struct `FileScan` over the config
One thing I tried was adding a wrapper around `FileScanConfig` called
`FileScan`. `FileScan` would impl `DataSource` and look something like:
```rs
struct FileScan {
file_source: Arc<dyn FileSource>,
config: FileScanConfig
}
```
This way, rather than passing the parent struct (`FileScanConfig`) into the
child struct (`FileSource`) when doing file operations, we'd pass a sibling
struct. And _technically_ by extracting `file_source` out of `FileScanConfig`,
the data execution flow shouldn't touch the config.
However, I'm dubious whether this refactor will do anything meaningful.
We're introducing another struct to be explicit about what implements
`DataSource`, but we haven't really done anything major to `FileScanConfig`.
<br>
## Another approach: let `ParquetSource`, `JsonSource`, etc directly impl
`DataSource`
From what I understand, prior versions had file-specific execution plans
(`ParquetExec`, `JsonExec`, etc..) and the existing version was written to
bridge similar functionality.
However, it seems like each file source has pretty different
characteristics/implementations. Now, we have to bear this abstraction cost by
maintaining a format-agnostic config (`FileScanConfig`) with format-specific
fields like `new_lines_in_values` or `expr_adapter_factory` that is only
specific to 1 file source. We also introduce various call sites throughout the
`FileScanConfig` -> `FileSource` relationship that does the same thing like
setting the batch size/statistics/projection.
I propose we directly implement `DataSource` for all structs that impl
`FileSource`. Each format already has most of what they need to implement
`DataSource` already! This way, we'll at least narrow down which fields are
shared throughout all file sources. Note, the refactor will still allow us to
do things like https://github.com/apache/datafusion/issues/17095
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]