Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20727#discussion_r173633462 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala --- @@ -42,7 +52,12 @@ class HadoopFileLinesReader( Array.empty) val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) - val reader = new LineRecordReader() + val reader = if (lineSeparator != "\n") { + new LineRecordReader(lineSeparator.getBytes("UTF-8")) --- End diff -- Why do you think this class is responsible for converting string separator to array of bytes? Especially restriction by one charset is not clear. The purpose of the class is to provide the Iterator interface of records/lines to datasources. And this class doesn't have to know about datasource's charset. I would not stick on particular charset here and expose the separator parameter with `Option[Array[Byte]]` like the LineReader provides a constructor with `byte[] recordDelimiter`.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org