[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

MaxGekk Sat, 12 May 2018 11:48:07 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21247#discussion_r187780271
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
    @@ -138,3 +121,40 @@ private[sql] class JSONOptions(
         factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, 
allowUnquotedControlChars)
       }
     }
    +
    +private[sql] class JSONOptionsInRead(
    +    @transient override val parameters: CaseInsensitiveMap[String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String)
    +  extends JSONOptions(parameters, defaultTimeZoneId, 
defaultColumnNameOfCorruptRecord) {
    +
    +  def this(
    +    parameters: Map[String, String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String = "") = {
    +    this(
    +      CaseInsensitiveMap(parameters),
    +      defaultTimeZoneId,
    +      defaultColumnNameOfCorruptRecord)
    +  }
    +
    +  protected override def checkedEncoding(enc: String): String = {
    +    // The following encodings are not supported in per-line mode 
(multiline is false)
    +    // because they cause some problems in reading files with BOM which is 
supposed to
    +    // present in the files with such encodings. After splitting input 
files by lines,
    +    // only the first lines will have the BOM which leads to impossibility 
for reading
    +    // the rest lines. Besides of that, the lineSep option must have the 
BOM in such
    +    // encodings which can never present between lines.
    +    val blacklist = Seq(Charset.forName("UTF-16"), 
Charset.forName("UTF-32"))
    +    val isBlacklisted = blacklist.contains(Charset.forName(enc))
    +    require(multiLine || !isBlacklisted,
    --- End diff --
    
    There is no reasons to blacklist `UTF-16` and `UTF-32` in write. I have 
checked the content of written JSON files on @gatorsmile 's 
[test](https://github.com/apache/spark/pull/21247/commits/97c4af76addc78a85ceb503a5db16f3285f18a5f).
 For example, for `UTF-16`
    ```
    $ hexdump -C ...c000.json
    00000000  fe ff 00 7b 00 22 00 5f  00 31 00 22 00 3a 00 22  
|...{."._.1.".:."|
    00000010  00 61 00 22 00 2c 00 22  00 5f 00 32 00 22 00 3a  
|.a.".,."._.2.".:|
    00000020  00 31 00 7d 00 0a 00 7b  00 22 00 5f 00 31 00 22  
|.1.}...{."._.1."|
    00000030  00 3a 00 22 00 63 00 22  00 2c 00 22 00 5f 00 32  
|.:.".c.".,."._.2|
    00000040  00 22 00 3a 00 33 00 7d  00 0a                    |.".:.3.}..|
    0000004a
    ```
    It contains BOM `fe ff` at the beginning as it is expected, and written 
line separator doesn't contains BOM (look at the position 0x24-0x25) - `00 7d` 
**00 0a** `00 7b`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

Reply via email to