connec opened a new pull request, #11533:
URL: https://github.com/apache/datafusion/pull/11533

   ## Which issue does this PR close?
   
   Closes #11472.
   
   ## Rationale for this change
   
   This significantly simplifies the UX when dealing with large CSV files that 
must support newlines in (quoted) values. By default, large CSV files will be 
repartitioned into multiple parallel range scans. This is great for performance 
in the common case but when large CSVs contain newlines in values the parallel 
scan will fail due to splitting on newlines within quotes rather than actual 
line terminators.
   
   With the current implementation, this behaviour can only be controlled by 
the session-level `datafusion.optimizer.repartition_file_scans` and 
`datafusion.optimizer.repartition_file_min_size` settings.
   
   ## What changes are included in this PR?
   
   This commit introduces a `newlines_in_values` option to `CsvOptions` and 
plumbs it through to `CsvExec`, which includes it in the test for whether 
parallel execution is supported. This provides a convenient and searchable way 
to disable file scan repartitioning on a per-CSV basis.
   
   I've added `newlines_in_values` using similar conventions to `has_header`, 
with `CsvOptions` using an `Option<bool>` and a default value coming from 
`datafusion::common::config::CatalogOptions`.
   
   For now, in the interests of being surgical, I've just added a new argument 
to 
[`CsvExec::new`](https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/struct.CsvExec.html#method.new),
 which is now triggering `clippy::too_many_arguments`. Before going any further 
I wanted to see if this was overall a good approach, but I'm happy to refactor 
this into an options struct or similar.
   
   ## Are these changes tested?
   
   Yes – a new test has been added alongside the existing tests for file scan 
repartioning in `datafusion/core/src/datasource/file_format/csv.rs`.
   
   ## Are there any user-facing changes?
   
   - *Breaking:* Add public 
`datafusion::common::config::CatalogOptions::newlines_in_values: bool` field, 
default: `false`.
   - *Breaking:* Add public 
`datafusion::common::config::CsvOptions::newlines_in_values: Option<bool>` 
field, default: `None`.
   - *Breaking:* Add public 
`datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values:
 bool` field, default: `false`.
   - *Breaking:* Add `newlines_in_values: bool` argument to 
`datafusion::datasource::physical_plan::CsvExec::new`.
   - Add public 
`datafusion::common::config::CsvOptions::with_newlines_in_values` method.
   - Add public 
`datafusion::datasource::file_format::csv::CsvFormat::with_newlines_in_values` 
method.
   - Add public 
`datafusion::datasource::file_format::options::CsvReadOptions::newlines_in_values`
 method.
   - Add public 
`datafusion::datasource::physical_plan::CsvExec::newlines_in_values` method.
   - Add `newlines_in_values` to relevant proto files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to