[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

MaxGekk Fri, 21 Sep 2018 06:27:18 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22503#discussion_r219495737
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala
 ---
    @@ -212,6 +212,7 @@ class CSVOptions(
         settings.setEmptyValue(emptyValueInRead)
         settings.setMaxCharsPerColumn(maxCharsPerColumn)
         
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_DELIMITER)
    +    settings.setLineSeparatorDetectionEnabled(true)
    --- End diff --
    
    The auto-detection mechanism is enabled for both - multi-line and per-line 
mode. I guess it has some overhead on detection of new lines which is not 
needed in per-line mode. I would benchmark it in both modes (see 
`CSVBenchmarks`), and if the overhead in per-line mode is significant, I would 
not enable the option when `multiLine` is set to `false`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #22503: [SPARK-25493] [SQL] Fix multiline crlf

Reply via email to