Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22503#discussion_r219495737
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
---
@@ -212,6 +212,7 @@ class CSVOptions(
settings.setEmptyValue(emptyValueInRead)
settings.setMaxCharsPerColumn(maxCharsPerColumn)
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
+ settings.setLineSeparatorDetectionEnabled(true)
--- End diff --
The auto-detection mechanism is enabled for both - multi-line and per-line
mode. I guess it has some overhead on detection of new lines which is not
needed in per-line mode. I would benchmark it in both modes (see
`CSVBenchmarks`), and if the overhead in per-line mode is significant, I would
not enable the option when `multiLine` is set to `false`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]