[
https://issues.apache.org/jira/browse/SPARK-56654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56654:
-----------------------------------
Labels: pull-request-available (was: )
> Enforce strict Unicode validation in JSON parsing
> -------------------------------------------------
>
> Key: SPARK-56654
> URL: https://issues.apache.org/jira/browse/SPARK-56654
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 4.0.0
> Reporter: jahnavi
> Priority: Major
> Labels: pull-request-available
> Original Estimate: 96h
> Remaining Estimate: 96h
>
> The JVM execution path uses a Jackson-based parser (e.g.,
> ReaderBasedJsonParser), which is permissive. It does not strictly validate
> Unicode surrogate pairs and will accept invalid sequences such as a lone high
> surrogate (\uD835). When this happens, it silently replaces the invalid
> character with a placeholder (?) and returns a parsed result.
>
> The alternative execution like simdjson uses a stricter JSON parser this is
> used in photon in Databricks. It explicitly checks for correct surrogate
> pairing and rejects malformed sequences. As a result, it returns NULL for
> try_parse_json or throws an error for parse_json.
>
> Because of this difference:
>
> The same input JSON can be rejected in one execution path and accepted (with
> data corruption) in another.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]