[
https://issues.apache.org/jira/browse/FLINK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871456#comment-17871456
]
Shankar Krishna commented on FLINK-19254:
-----------------------------------------
I see the issue with JSON parsing caused in the Underlying table libraries :
org.apache.flink.table.data.binary.BinaryStringData extended by StringData in
the same package, which is used to convert row data from underlying JSON(s).
This only supports conversion using UTF8 and used by the
JsonParserToRowDataConverters class. This is in Apache flink-table-common
module under flink-table project. This will need to be enhanced to support
UTF-16. UTF-16 characters are very common due to data sourced from web pages (
copy paste), for example the em-dash (Unicide 2013), the Non breaking space(
Unicode 00A0), and other accented / Umlaut latin characters in European
languages that get into description fields of data. This should be supported
and most query engines like Trino have no issues with this!!!
> Invalid UTF-8 start byte exception
> -----------------------------------
>
> Key: FLINK-19254
> URL: https://issues.apache.org/jira/browse/FLINK-19254
> Project: Flink
> Issue Type: Bug
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
> Affects Versions: 1.11.0
> Reporter: Jun Zhang
> Priority: Minor
> Labels: auto-deprioritized-major
> Fix For: 2.0.0
>
>
> when read no utf8 data ,JsonRowDeserializationSchema throw a exception.
> {code:java}
> Caused by:
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
> Invalid UTF-8 start byte xxx
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)