Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/21247#discussion_r187780271
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
---
@@ -138,3 +121,40 @@ private[sql] class JSONOptions(
factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS,
allowUnquotedControlChars)
}
}
+
+private[sql] class JSONOptionsInRead(
+ @transient override val parameters: CaseInsensitiveMap[String],
+ defaultTimeZoneId: String,
+ defaultColumnNameOfCorruptRecord: String)
+ extends JSONOptions(parameters, defaultTimeZoneId,
defaultColumnNameOfCorruptRecord) {
+
+ def this(
+ parameters: Map[String, String],
+ defaultTimeZoneId: String,
+ defaultColumnNameOfCorruptRecord: String = "") = {
+ this(
+ CaseInsensitiveMap(parameters),
+ defaultTimeZoneId,
+ defaultColumnNameOfCorruptRecord)
+ }
+
+ protected override def checkedEncoding(enc: String): String = {
+ // The following encodings are not supported in per-line mode
(multiline is false)
+ // because they cause some problems in reading files with BOM which is
supposed to
+ // present in the files with such encodings. After splitting input
files by lines,
+ // only the first lines will have the BOM which leads to impossibility
for reading
+ // the rest lines. Besides of that, the lineSep option must have the
BOM in such
+ // encodings which can never present between lines.
+ val blacklist = Seq(Charset.forName("UTF-16"),
Charset.forName("UTF-32"))
+ val isBlacklisted = blacklist.contains(Charset.forName(enc))
+ require(multiLine || !isBlacklisted,
--- End diff --
There is no reasons to blacklist `UTF-16` and `UTF-32` in write. I have
checked the content of written JSON files on @gatorsmile 's
[test](https://github.com/apache/spark/pull/21247/commits/97c4af76addc78a85ceb503a5db16f3285f18a5f).
For example, for `UTF-16`
```
$ hexdump -C ...c000.json
00000000 fe ff 00 7b 00 22 00 5f 00 31 00 22 00 3a 00 22
|...{."._.1.".:."|
00000010 00 61 00 22 00 2c 00 22 00 5f 00 32 00 22 00 3a
|.a.".,."._.2.".:|
00000020 00 31 00 7d 00 0a 00 7b 00 22 00 5f 00 31 00 22
|.1.}...{."._.1."|
00000030 00 3a 00 22 00 63 00 22 00 2c 00 22 00 5f 00 32
|.:.".c.".,."._.2|
00000040 00 22 00 3a 00 33 00 7d 00 0a |.".:.3.}..|
0000004a
```
It contains BOM `fe ff` at the beginning as it is expected, and written
line separator doesn't contains BOM (look at the position 0x24-0x25) - `00 7d`
**00 0a** `00 7b`.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]