Yuan Yuan created SPARK-54102:
---------------------------------

             Summary: Spark 4.0.1 still throws "String length (20054016) 
exceeds the maximum length (20000000)" and "from_json" fails on a very large 
JSON with a jackson_core parse error
                 Key: SPARK-54102
                 URL: https://issues.apache.org/jira/browse/SPARK-54102
             Project: Spark
          Issue Type: Bug
          Components: PySpark
    Affects Versions: 4.0.1
         Environment: pyspark 4.0.1
            Reporter: Yuan Yuan


Based on JIRA *SPARK-49872* and the implementation in 
[{{JsonProtocol.scala}}|https://github.com/apache/spark/blob/29434ea766b0fc3c3bf6eaadb43a8f931133649e/core/src/main/scala/org/apache/spark/util/JsonProtocol.scala#L71],
 the relevant limit was removed in {*}4.0.1{*}. However, in our environment we 
can still reliably reproduce:
 # When generating/processing a very large string:

{code:java}
Caused by: com.fasterxml.jackson.core.exc.StreamConstraintsException: String 
value length (20040525) exceeds the maximum allowed (20000000, from 
`StreamReadConstraints.getMaxStringLength()`){code}
 # When using {{from_json}} on a _valid_ very large single-line JSON (no 
missing comma), Jackson throws at around column {*}20,271,838{*}:

{code:java}
Caused by: com.fasterxml.jackson.core.JsonParseException: Unexpected character 
('1' (code 49)): was expecting comma to separate Object entries
 at [Source: UNKNOWN; line: 1, column: 20271838]{code}
I'm sure this is not a formatting issue. If I truncate the JSON to below column 
{*}20,271,838{*}, it parses successfully.
Here is my parsing code:

{code:java}
raw_df.withColumn("parsed_item", f.from_json(f.col("item"), my_schema){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to