Yi Zhang created SPARK-41650:
--------------------------------

             Summary: json expressions much slower in optimized mode
                 Key: SPARK-41650
                 URL: https://issues.apache.org/jira/browse/SPARK-41650
             Project: Spark
          Issue Type: Bug
          Components: Spark Core, Structured Streaming
    Affects Versions: 3.2.2
            Reporter: Yi Zhang


I noticed spark structured streaming reading from Kafka json string into struct 
type is much slower in spark-3.1+ than spark-3.0. Profiling reveals the json 
expressions in spark-3.0 mostly on evaluate subExpr, while spark-3.1/3.2 spent 
a lot time on writeField. 

Suspect this may be related to SPARK-32948, so I tried with add a bogus option 

from_json($"value", mySchema, Map("bogus_key"-> "bogus_value")

this turns off the optimization and the performance is much better. For 
reference, 

for same amount #records, it is 30 seconds vs. 3 minute on a task processing 
500k records. This is big difference for a streaming job.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to