Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20727#discussion_r173633462
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
---
@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
Array.empty)
val attemptId = new TaskAttemptID(new TaskID(new JobID(),
TaskType.MAP, 0), 0)
val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
- val reader = new LineRecordReader()
+ val reader = if (lineSeparator != "\n") {
+ new LineRecordReader(lineSeparator.getBytes("UTF-8"))
--- End diff --
Why do you think this class is responsible for converting string separator
to array of bytes? Especially restriction by one charset is not clear. The
purpose of the class is to provide the Iterator interface of records/lines to
datasources. And this class doesn't have to know about datasource's charset. I
would not stick on particular charset here and expose the separator parameter
with `Option[Array[Byte]]` like the LineReader provides a constructor with
`byte[] recordDelimiter`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]