[
https://issues.apache.org/jira/browse/SPARK-25837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667646#comment-16667646
]
Patrick Brown edited comment on SPARK-25837 at 10/29/18 7:40 PM:
-----------------------------------------------------------------
The fundamental problem seems to be in AppStatusLisener in the cleanupStages
method.
Using the repro code above it appears that sometimes (not always) stages and
tasks get slightly backed up. When this occurs the iteration through tasks
starts taking longer and longer:
{code:java}
val tasks = kvstore.view(classOf[TaskDataWrapper])
.index("stage")
.first(key)
.last(key)
.asScala{code}
This seems to be because for each stage we are then iterating through all the
tasks (of which there can be ~400k in this repro code), which can go from
taking ~10ms before the back up to ~300ms afterwards due to the large number of
tasks. This causes a feedback loop in which the `cleanupStages` method cannot
keep up.
was (Author: patrick.brown):
The fundamental problem seems to be in AppStatusLisener in the cleanupStages
method.
Using the repro code above it appears that sometimes (not always) stages and
tasks get slightly backed up. When this occurs the iteration through tasks
starts taking longer and longer:
{code:java}
val tasks = kvstore.view(classOf[TaskDataWrapper])
.index("stage")
.first(key)
.last(key)
.asScala{code}
```
This seems to be because for each stage we are then iterating through all the
tasks (of which there can be ~400k in this repro code), which can go from
taking ~10ms before the back up to ~300ms afterwards due to the large number of
tasks. This causes a feedback loop in which the `cleanupStages` method cannot
keep up.
> Web UI does not respect spark.ui.retainedJobs in some instances
> ---------------------------------------------------------------
>
> Key: SPARK-25837
> URL: https://issues.apache.org/jira/browse/SPARK-25837
> Project: Spark
> Issue Type: Bug
> Components: Web UI
> Affects Versions: 2.3.1
> Environment: Reproduction Environment:
> Spark 2.3.1
> Dataproc 1.3-deb9
> 1x master 4 vCPUs, 15 GB
> 2x workers 4 vCPUs, 15 GB
>
> Reporter: Patrick Brown
> Priority: Minor
> Attachments: Screen Shot 2018-10-23 at 4.40.51 PM (1).png
>
>
> Expected Behavior: Web UI only displays 1 completed job and remains
> responsive.
> Actual Behavior: Both during job execution and following all job completion
> for some non short amount of time the UI retains many completed jobs, causing
> limited responsiveness.
>
> To reproduce:
>
> > spark-shell --conf spark.ui.retainedJobs=1
>
> scala> import scala.concurrent._
> scala> import scala.concurrent.ExecutionContext.Implicits.global
> scala> for (i <- 0 until 50000) { Future
> { println(sc.parallelize(0 until i).collect.length) }
> }
>
>
>
> The attached screenshot shows the state of the webui after running the repro
> code, you can see the ui is displaying some 43k completed jobs (takes a long
> time to load) after a few minutes of inactivity this will clear out, however
> in an application which continues to submit jobs every once in a while, the
> issue persists.
>
> The issue seems to appear when running multiple jobs at once as well as in
> sequence for a while and may as well have something to do with high master
> CPU usage (thus the collect in the repro code). My rough guess would be
> whatever is managing clearing out completed jobs gets overwhelmed (on the
> master during repro htop reported almost full CPU usage across all 4 cores).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]