friendlymatthew commented on PR #17076:
URL: https://github.com/apache/datafusion/pull/17076#issuecomment-3168929125

   > Thanks @friendlymatthew
   > 
   > While I can see the rationale behind this change, I'd like to understand 
more about why this approach is definitively better than the existing 
implementation. Given that this modifies the `FileSource` it would be a 
breaking change for downstream users?
   > 
   > Please let me know the specific example where builder wouldn't work for 
you?
   
   Hi, the `FileConfigBuilder` initializing batch size as an `Option` just for 
every subsequent call site to require it as `Some` felt awkward. The 
`with_batch_size` signature reinforces this, since it only accepts a `usize` 
and immediately wraps it in `Some`. 
   
   From what I understand, batch size is only used when opening/reading files. 
Instead of requiring the user to set it in advance (and risking a panic if 
omitted), I think it's cleaner to pass it as a required parameter at the point 
where it's actually needed: 
   
   ```rs
   let config = CsvSource::new(true, b',', b'"')
       .with_comment(Some(b'#'))
       .with_schema(schema)
   //  .with_batch_size(8192) --> If I omit this, we'll panic when we create a 
file opener
       .with_projection(&scan_config);
   
   let opener = config.create_file_opener(object_store, &scan_config, 0); // 
let's just pass it in here?
   ```
   
   > The current approach using with_batch_size() follows a builder pattern 
which is flexible and composable. Given that this modifies the FileSource trait 
and affects multiple DataSource structs, what's the migration path for 
downstream users?
   
   One option would be to deprecate `with_batch_size` and point users toward 
the new parameter-passing approach. (I'm considering doing something similar 
for `with_projection`: https://github.com/apache/datafusion/issues/17095).
   
   Stepping back, my main goal is to break the tight coupling between 
`FileSourceConfig` and `FileSource`. Right now, the config struct both stores 
configuration values _and_ the underlying file source, which in turn needs to 
call back into the config-- that feels like a design smell


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to