[ 
https://issues.apache.org/jira/browse/SPARK-56654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56654:
-----------------------------------
    Labels: pull-request-available  (was: )

> Enforce strict Unicode validation in JSON parsing
> -------------------------------------------------
>
>                 Key: SPARK-56654
>                 URL: https://issues.apache.org/jira/browse/SPARK-56654
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 4.0.0
>            Reporter: jahnavi
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> The JVM execution path uses a Jackson-based parser (e.g., 
> ReaderBasedJsonParser), which is permissive. It does not strictly validate 
> Unicode surrogate pairs and will accept invalid sequences such as a lone high 
> surrogate (\uD835). When this happens, it silently replaces the invalid 
> character with a placeholder (?) and returns a parsed result.
>  
> The alternative execution like simdjson uses a stricter JSON parser this is 
> used in photon in Databricks. It explicitly checks for correct surrogate 
> pairing and rejects malformed sequences. As a result, it returns NULL for 
> try_parse_json or throws an error for parse_json.
>  
> Because of this difference:
>  
> The same input JSON can be rejected in one execution path and accepted (with 
> data corruption) in another.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to