[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

MaxGekk Sun, 06 May 2018 02:07:14 -0700

Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21247#discussion_r186283555
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
    @@ -137,3 +121,40 @@ private[sql] class JSONOptions(
         factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, 
allowUnquotedControlChars)
       }
     }
    +
    +private[sql] class JSONOptionsInRead(
    +    @transient private val parameters: CaseInsensitiveMap[String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String)
    +  extends JSONOptions(parameters, defaultTimeZoneId, 
defaultColumnNameOfCorruptRecord) {
    +
    +  def this(
    +    parameters: Map[String, String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String = "") = {
    +    this(
    +      CaseInsensitiveMap(parameters),
    +      defaultTimeZoneId,
    +      defaultColumnNameOfCorruptRecord)
    +  }
    +
    +  protected override def checkedEncoding(enc: String): String = {
    +    // The following encodings are not supported in per-line mode 
(multiline is false)
    +    // because they cause some problems in reading files with BOM which is 
supposed to
    +    // present in the files with such encodings. After splitting input 
files by lines,
    +    // only the first lines will have the BOM which leads to impossibility 
for reading
    +    // the rest lines. Besides of that, the lineSep option must have the 
BOM in such
    +    // encodings which can never present between lines.
    +    val blacklist = Seq(Charset.forName("UTF-16"), 
Charset.forName("UTF-32"))
    +    val isBlacklisted = blacklist.contains(Charset.forName(enc))
    +    require(multiLine || !isBlacklisted,
    +      s"""The ${enc} encoding must not be included in the blacklist when 
multiLine is disabled:
    +         | ${blacklist.mkString(", ")}""".stripMargin)
    +
    +    val isLineSepRequired = !(multiLine == false &&
    +      Charset.forName(enc) != StandardCharsets.UTF_8 && 
lineSeparator.isEmpty)
    +    require(isLineSepRequired, s"The lineSep option must be specified for 
the $enc encoding")
    --- End diff --
    
    Do you mean rewriting Hadoop's LineReader to detect `\n`, `\r` and `\r\n` 
for any encoding? If so, I am working on it but I think we shouldn't restrict 
writer till we remove the restriction for reader.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

Reply via email to