Github user HyukjinKwon commented on a diff in the pull request:
https://github.com/apache/spark/pull/20937#discussion_r180014167
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -366,6 +366,9 @@ class DataFrameReader private[sql](sparkSession:
SparkSession) extends Logging {
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
* <li>`multiLine` (default `false`): parse one record, which may span
multiple lines,
* per file</li>
+ * <li>`encoding` (by default it is not set): allows to forcibly set one
of standard basic
+ * or extended charsets for input jsons. For example UTF-8, UTF-16BE,
UTF-32. If the encoding
+ * is not specified (by default), it will be detected automatically.</li>
--- End diff --
> If encoding is not set, it will be detected by Jackson independently from
multiline.
Jackson detects but Spark doesn't correctly when `multiLine` is disabled
even with this PR, as we talked. We found many holes. Why did you bring this
again?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]