[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

cloud-fan Sun, 22 Apr 2018 20:43:10 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20937#discussion_r183271101
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
    @@ -86,14 +86,41 @@ private[sql] class JSONOptions(
     
       val multiLine = 
parameters.get("multiLine").map(_.toBoolean).getOrElse(false)
     
    +  /**
    +   * A string between two consecutive JSON records.
    +   */
       val lineSeparator: Option[String] = parameters.get("lineSep").map { sep 
=>
         require(sep.nonEmpty, "'lineSep' cannot be an empty string.")
         sep
       }
    -  // Note that the option 'lineSep' uses a different default value in read 
and write.
    -  val lineSeparatorInRead: Option[Array[Byte]] =
    -    lineSeparator.map(_.getBytes(StandardCharsets.UTF_8))
    -  // Note that JSON uses writer with UTF-8 charset. This string will be 
written out as UTF-8.
    +
    +  /**
    +   * Standard encoding (charset) name. For example UTF-8, UTF-16LE and 
UTF-32BE.
    +   * If the encoding is not specified (None), it will be detected 
automatically
    +   * when the multiLine option is set to `true`.
    +   */
    +  val encoding: Option[String] = parameters.get("encoding")
    +    .orElse(parameters.get("charset")).map { enc =>
    +      // The following encodings are not supported in per-line mode 
(multiline is false)
    +      // because they cause some problems in reading files with BOM which 
is supposed to
    +      // present in the files with such encodings. After splitting input 
files by lines,
    +      // only the first lines will have the BOM which leads to 
impossibility for reading
    +      // the rest lines. Besides of that, the lineSep option must have the 
BOM in such
    +      // encodings which can never present between lines.
    +      val blacklist = Seq(Charset.forName("UTF-16"), 
Charset.forName("UTF-32"))
    +      val isBlacklisted = blacklist.contains(Charset.forName(enc))
    +      require(multiLine || !isBlacklisted,
    +        s"""The ${enc} encoding must not be included in the blacklist when 
multiLine is disabled:
    +           | ${blacklist.mkString(", ")}""".stripMargin)
    +
    +      val forcingLineSep = !(multiLine == false && enc != "UTF-8" && 
lineSeparator.isEmpty)
    --- End diff --
    
    `enc != "UTF-8"`, we should not compare string directly, but turn them into 
`Charset`



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #20937: [SPARK-23094][SPARK-23723][SPARK-23724][SQL] Supp...

Reply via email to