[GitHub] spark pull request #20727: [SPARK-23577][SQL] Supports custom line separator...

MaxGekk Sat, 10 Mar 2018 12:12:25 -0800

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20727#discussion_r173633462
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
 ---
    @@ -42,7 +52,12 @@ class HadoopFileLinesReader(
           Array.empty)
         val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
         val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
    -    val reader = new LineRecordReader()
    +    val reader = if (lineSeparator != "\n") {
    +      new LineRecordReader(lineSeparator.getBytes("UTF-8"))
    --- End diff --
    
    Why do you think this class is responsible for converting string separator 
to array of bytes? Especially restriction by one charset is not clear. The 
purpose of the class is to provide the Iterator interface of records/lines to 
datasources. And this class doesn't have to know about datasource's charset. I 
would not stick on particular charset here and expose the separator parameter 
with `Option[Array[Byte]]` like the LineReader provides a constructor with 
`byte[] recordDelimiter`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #20727: [SPARK-23577][SQL] Supports custom line separator...

Reply via email to