[ 
https://issues.apache.org/jira/browse/FLINK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17871456#comment-17871456
 ] 

Shankar Krishna commented on FLINK-19254:
-----------------------------------------

I see the issue with JSON parsing caused in the Underlying table libraries : 
org.apache.flink.table.data.binary.BinaryStringData extended by StringData in 
the same package, which is used to convert row data from underlying JSON(s). 
This only supports conversion using UTF8 and used by the 
JsonParserToRowDataConverters class.  This is in Apache flink-table-common 
module under flink-table project. This will need to be enhanced to support 
UTF-16. UTF-16 characters are very common due to data sourced from web pages ( 
copy paste), for example the em-dash (Unicide 2013), the Non breaking space( 
Unicode 00A0), and other accented / Umlaut latin characters in European 
languages that get into description fields of data. This should be supported 
and most query engines like Trino have no issues with this!!!

> Invalid UTF-8 start byte exception 
> -----------------------------------
>
>                 Key: FLINK-19254
>                 URL: https://issues.apache.org/jira/browse/FLINK-19254
>             Project: Flink
>          Issue Type: Bug
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.11.0
>            Reporter: Jun Zhang
>            Priority: Minor
>              Labels: auto-deprioritized-major
>             Fix For: 2.0.0
>
>
> when read  no utf8 data ,JsonRowDeserializationSchema throw a exception.
> {code:java}
> Caused by: 
> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.core.JsonParseException:
>  Invalid UTF-8 start byte xxx 
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to