Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/21247#discussion_r186284065
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala
---
@@ -137,3 +121,40 @@ private[sql] class JSONOptions(
factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS,
allowUnquotedControlChars)
}
}
+
+private[sql] class JSONOptionsInRead(
+ @transient private val parameters: CaseInsensitiveMap[String],
+ defaultTimeZoneId: String,
+ defaultColumnNameOfCorruptRecord: String)
+ extends JSONOptions(parameters, defaultTimeZoneId,
defaultColumnNameOfCorruptRecord) {
+
+ def this(
+ parameters: Map[String, String],
+ defaultTimeZoneId: String,
+ defaultColumnNameOfCorruptRecord: String = "") = {
+ this(
+ CaseInsensitiveMap(parameters),
+ defaultTimeZoneId,
+ defaultColumnNameOfCorruptRecord)
+ }
+
+ protected override def checkedEncoding(enc: String): String = {
+ // The following encodings are not supported in per-line mode
(multiline is false)
+ // because they cause some problems in reading files with BOM which is
supposed to
+ // present in the files with such encodings. After splitting input
files by lines,
+ // only the first lines will have the BOM which leads to impossibility
for reading
+ // the rest lines. Besides of that, the lineSep option must have the
BOM in such
+ // encodings which can never present between lines.
+ val blacklist = Seq(Charset.forName("UTF-16"),
Charset.forName("UTF-32"))
+ val isBlacklisted = blacklist.contains(Charset.forName(enc))
+ require(multiLine || !isBlacklisted,
+ s"""The ${enc} encoding must not be included in the blacklist when
multiLine is disabled:
+ | ${blacklist.mkString(", ")}""".stripMargin)
+
+ val isLineSepRequired = !(multiLine == false &&
+ Charset.forName(enc) != StandardCharsets.UTF_8 &&
lineSeparator.isEmpty)
+ require(isLineSepRequired, s"The lineSep option must be specified for
the $enc encoding")
--- End diff --
Yea, and also I thought you are working on getting rid of blacklisting
encodings too. Roughly this PR makes sense for the intermediate status since we
have different requirements in write and read path; however, I thought we
should just better try to remove the restrictions first until the release
becomes close, and the current change should be done in the last minute if we
failed to get rid of the restrictions.
I am a bit cautious of the current change since it's pretty new approach to
datasources.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]