Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20727#discussion_r172656702
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
---
@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
Array.empty)
val attemptId = new TaskAttemptID(new TaskID(new JobID(),
TaskType.MAP, 0), 0)
val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
- val reader = new LineRecordReader()
+ val reader = if (lineSeparator != "\n") {
+ new LineRecordReader(lineSeparator.getBytes("UTF-8"))
--- End diff --
My suggestion is to pass Array[Byte] into the class. If charsets different
from UTF-8 will be supported in the future, this place should be changed for
sure. You can make this class more tolerant to input charsets right now. Just
for an example, json reader (jackson json parser) is able to read json in any
standard charsets. To fix its per-line mode, need to support lineSep in any
charset and convert lineSep to array of byte before using the class. If you
restrict charset of lineSep to UTF-8, you just make the wall for other
datasources.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]