Neven Jovic created SPARK-38329:
-----------------------------------

             Summary: High I/O wait when Spark Structured Streaming checkpoint 
changed to EFS
                 Key: SPARK-38329
                 URL: https://issues.apache.org/jira/browse/SPARK-38329
             Project: Spark
          Issue Type: Question
          Components: EC2, Input/Output, PySpark, Structured Streaming
    Affects Versions: 2.4.6
            Reporter: Neven Jovic
         Attachments: Screenshot from 2022-02-25 14-16-11.png

I'm currently running spark structured streaming application written in 
python(pyspark) where my source is kafka topic and sink i mongodb. I changed my 
checkpoint to Amazon EFS, which is distributed on all spark workers and after 
that I got increased I/o wait, averaging 8%

 

!image-2022-02-25-14-42-31-904.png!

Currently I have 6000 messages coming to kafka every second, and I get every 
once in a while a WARN message:
{quote}22/02/25 13:12:31 WARN HDFSBackedStateStoreProvider: Error cleaning up 
files for HDFSStateStoreProvider[id = (op=0,part=90),dir = 
file:/mnt/efs_max_io/spark/state/0/90] java.lang.NumberFormatException: For 
input string: ""
{quote}
I'm not quite sure if that message has anything to do with high I/O wait and is 
this behavior expected, or something to be concerned about?
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to