[
https://issues.apache.org/jira/browse/SPARK-34063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282081#comment-17282081
]
Calvin Pietersen commented on SPARK-34063:
------------------------------------------
So this happened consistently every 6 days about 8 times consecutively. After
downgrading to spark 2.4.4 and emr 6.0.0 the issue was resolved. I haven't
compared memory dumps between spark 2.4.4. vs 3.0.0, however, I'd be willing to
if someone wanted investigate.
> Major slowdown in spark streaming after 6 days
> ----------------------------------------------
>
> Key: SPARK-34063
> URL: https://issues.apache.org/jira/browse/SPARK-34063
> Project: Spark
> Issue Type: Bug
> Components: Scheduler, Spark Core
> Affects Versions: 3.0.0
> Environment: AWS EMR 6.1.0
> Spark 3.0.0
> Kinesis
> Reporter: Calvin Pietersen
> Priority: Major
> Attachments: 2020-12-29.pdf, normal-job, slow-job
>
>
> Spark streaming application runs at 60s batch intervals.
> The application runs fine processing batches around 40s. After ~8600 batches
> (around 6 days), the application all of a sudden hits a wall and processing
> time jumps to 2-2.4 minutes, and eventually dies with exit code 137. This
> happens consistently every 6 days, regardless of data.
> Looking at the application logs, it seems like when the issue begins, tasks
> are being completed by executors, however the driver is taking a while to
> acknowledge. I have taken numerous memory dumps of the driver (before it hits
> the 6 day wall) using *jcmd* and can see the
> *org.apache.spark.scheduler.AsyncEventQueue* is growing in size despite the
> fact that the application is able to keep up with batches. I have yet to take
> a snapshot of the application in the broken state.
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]