[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

HyukjinKwon Sun, 06 May 2018 02:24:27 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21247#discussion_r186284065
  
    --- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
 ---
    @@ -137,3 +121,40 @@ private[sql] class JSONOptions(
         factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, 
allowUnquotedControlChars)
       }
     }
    +
    +private[sql] class JSONOptionsInRead(
    +    @transient private val parameters: CaseInsensitiveMap[String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String)
    +  extends JSONOptions(parameters, defaultTimeZoneId, 
defaultColumnNameOfCorruptRecord) {
    +
    +  def this(
    +    parameters: Map[String, String],
    +    defaultTimeZoneId: String,
    +    defaultColumnNameOfCorruptRecord: String = "") = {
    +    this(
    +      CaseInsensitiveMap(parameters),
    +      defaultTimeZoneId,
    +      defaultColumnNameOfCorruptRecord)
    +  }
    +
    +  protected override def checkedEncoding(enc: String): String = {
    +    // The following encodings are not supported in per-line mode 
(multiline is false)
    +    // because they cause some problems in reading files with BOM which is 
supposed to
    +    // present in the files with such encodings. After splitting input 
files by lines,
    +    // only the first lines will have the BOM which leads to impossibility 
for reading
    +    // the rest lines. Besides of that, the lineSep option must have the 
BOM in such
    +    // encodings which can never present between lines.
    +    val blacklist = Seq(Charset.forName("UTF-16"), 
Charset.forName("UTF-32"))
    +    val isBlacklisted = blacklist.contains(Charset.forName(enc))
    +    require(multiLine || !isBlacklisted,
    +      s"""The ${enc} encoding must not be included in the blacklist when 
multiLine is disabled:
    +         | ${blacklist.mkString(", ")}""".stripMargin)
    +
    +    val isLineSepRequired = !(multiLine == false &&
    +      Charset.forName(enc) != StandardCharsets.UTF_8 && 
lineSeparator.isEmpty)
    +    require(isLineSepRequired, s"The lineSep option must be specified for 
the $enc encoding")
    --- End diff --
    
    Yea, and also I thought you are working on getting rid of blacklisting 
encodings too. Roughly this PR makes sense for the intermediate status since we 
have different requirements in write and read path; however, I thought we 
should just better try to remove the restrictions first until the release 
becomes close, and the current change should be done in the last minute if we 
failed to get rid of the restrictions.
    
    I am a bit cautious of the current change since it's pretty new approach to 
datasources.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21247: [SPARK-24190] Separating JSONOptions for read

Reply via email to