Github user MaxGekk commented on a diff in the pull request:
https://github.com/apache/spark/pull/20937#discussion_r180000138
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
---
@@ -361,6 +361,15 @@ class JacksonParser(
// For such records, all fields other than the field configured by
// `columnNameOfCorruptRecord` are set to `null`.
throw BadRecordException(() => recordLiteral(record), () => None,
e)
+ case e: CharConversionException if options.encoding.isEmpty =>
+ val msg =
+ """Failed to parse a character. Encoding was detected
automatically.
--- End diff --
ok, speaking about this concrete exception handling. The exception with the
message is thrown ONLY when options.encoding.isEmpty is `true`. It means
`encoding` is not set and actual encoding of a file was autodetected. The `msg`
says about that actually: `Encoding was detected automatically`.
Maybe `encoding` was detected correctly but the file contains a wrong char.
In that case, the first sentence says this `Failed to parse a character`. The
same could happen if you set `encoding` explicitly because you cannot guarantee
that inputs are alway correct.
> I think automatic detection is true only when multuline is enabled.
Wrong char in input file can be in a file with UTF-8 read with `multiline =
false` and in a file in UTF-16LE with `multiline = true`.
My point is the mention of the `multiline` option in the error message
doesn't help to user to solve the issue. A possible solution is to set
`encoding` explicitly - what the message says actually.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]