connec opened a new pull request, #11533: URL: https://github.com/apache/datafusion/pull/11533
## Which issue does this PR close? Closes #11472. ## Rationale for this change This significantly simplifies the UX when dealing with large CSV files that must support newlines in (quoted) values. By default, large CSV files will be repartitioned into multiple parallel range scans. This is great for performance in the common case but when large CSVs contain newlines in values the parallel scan will fail due to splitting on newlines within quotes rather than actual line terminators. With the current implementation, this behaviour can only be controlled by the session-level `datafusion.optimizer.repartition_file_scans` and `datafusion.optimizer.repartition_file_min_size` settings. ## What changes are included in this PR? This commit introduces a `newlines_in_values` option to `CsvOptions` and plumbs it through to `CsvExec`, which includes it in the test for whether parallel execution is supported. This provides a convenient and searchable way to disable file scan repartitioning on a per-CSV basis. I've added `newlines_in_values` using similar conventions to `has_header`, with `CsvOptions` using an `Option<bool>` and a default value coming from `datafusion::common::config::CatalogOptions`. For now, in the interests of being surgical, I've just added a new argument to [`CsvExec::new`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.CsvExec.html#method.new), which is now triggering `clippy::too_many_arguments`. Before going any further I wanted to see if this was overall a good approach, but I'm happy to refactor this into an options struct or similar. ## Are these changes tested? Yes – a new test has been added alongside the existing tests for file scan repartioning in `datafusion/core/src/datasource/file_format/csv.rs`. ## Are there any user-facing changes? - *Breaking:* Add public `datafusion::common::config::CatalogOptions::newlines_in_values: bool` field, default: `false`. - *Breaking:* Add public `datafusion::common::config::CsvOptions::newlines_in_values: Option<bool>` field, default: `None`. - *Breaking:* Add public `datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values: bool` field, default: `false`. - *Breaking:* Add `newlines_in_values: bool` argument to `datafusion::datasource::physical_plan::CsvExec::new`. - Add public `datafusion::common::config::CsvOptions::with_newlines_in_values` method. - Add public `datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values` method. - Add public `datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values` method. - Add public `datafusion::datasource::physical_plan::CsvExec::newlines_in_values` method. - Add `newlines_in_values` to relevant proto files. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org