Github user MaxGekk commented on the issue:
https://github.com/apache/spark/pull/20849
@HyukjinKwon I did an experiment on the
https://github.com/MaxGekk/spark-1/pull/2 and modified [the
test](https://github.com/MaxGekk/spark-1/blob/f94d846b39ade89da24ef3e85f9721fb34e48154/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala#L2072-L2083):
If `UTF-16LE` is set explicitly:
```
val jsonDF = spark.read.schema(schema)
.option("lineSep", "x0d 00 0a 00")
.option("encoding", "UTF-16LE")
.json(testFile(fileName))
```
only second line is returned correctly:
```
+---------+--------+
|firstName|lastName|
+---------+--------+
| null| null|
| Doug| Rood|
+---------+--------+
```
In the case of `UTF-16`, the first row is returned from the CSV file:
```
+---------+--------+
|firstName|lastName|
+---------+--------+
| Chris| Baird|
| null| null|
+---------+--------+
```
And you are right in the case if encoding is `UTF-16`, BOM is added to the
delimiter:
```
val jsonDF = spark.read.schema(schema)
.option("lineSep", "\r\n")
.option("encoding", "UTF-16")
.json(testFile(fileName))
```
The `lineSeparator` parameter of `HadoopFileLinesReader` is `0xFE 0xFF 0x00
0x0D 0x00 0x0A` - BOM + UTF-16BE (in the CSV file BOM+UTF-16LE). Even if we cut
BOM from lineSep, it will still not correct.
So, there are 2 (or 3) problems actually.
Just in case:
```
val jsonDF = spark.read.schema(schema)
.option("lineSep", "\r\n")
.option("encoding", "UTF-16LE")
.json(testFile(fileName))
```
```
+---------+--------+
|firstName|lastName|
+---------+--------+
| null| null|
| Doug| Rood|
+---------+--------+
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]