adriangb commented on PR #17076:
URL: https://github.com/apache/datafusion/pull/17076#issuecomment-3166507414

   I think this is a positive change. The fact that there is an `Option<T>` 
that _always_ gets `expect()`ed seems like a smell that something is wrong. The 
diff is also (currently) +35 / -72, another sign that this is a good change.
   
   Generally we would like to disentangle this area of code a bit starting by 
refactoring places like this. There are a couple other ways in which 
`FileSource` get's initialized in multiple places. Off the top of my head 
`file_schema`:
   
   
https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L553-L558
   
   Which _also_ gets pulled in from `FileScanConfig`:
   
   
https://github.com/pydantic/datafusion/blob/9cb592c70f9926b81102ee2270875fb4e19e4e1e/datafusion/datasource-parquet/src/source.rs#L532
   
   *I think* where some of this is heading is splitting `FileScanConfig` into:
   - `FileScanConfig`: holds configurations / is a builder for data shared 
across multiple data sources (only the _very_ shared things)
   - `FileScan`: does the actual execution
   By doing this refactor we can end up with something like:
   
   ```rust
   let config = FileScanConfigBuilder::new(file_schema).build();
   let source = ParquetSource::new(config.clone());  // needs access to file 
schema, for filter pushdown evaluation
   let scan = FileScan::new(source, config);  // needs access to the file 
schema for projection stuff? needs access the the files to be scanned?
   let exec = DataSourceExec::new(scan);
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to