Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20727#discussion_r172682591
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
---
@@ -42,7 +52,12 @@ class HadoopFileLinesReader(
Array.empty)
val attemptId = new TaskAttemptID(new TaskID(new JobID(),
TaskType.MAP, 0), 0)
val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
- val reader = new LineRecordReader()
+ val reader = if (lineSeparator != "\n") {
+ new LineRecordReader(lineSeparator.getBytes("UTF-8"))
--- End diff --
I mean, it's initially an unicode string via datasource interface and we
need to somehow convert it to bytes once as it takes bytes. Do you mean adding
another option for specifying charset or did I maybe miss something?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]