[jira] [Commented] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats

Imran Rashid (JIRA) Mon, 10 Dec 2018 12:37:21 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715520#comment-16715520
 ]


Imran Rashid commented on SPARK-26329:
--------------------------------------

oh good point, I totally forgot about that complication we had to work through 
on the driver side.

yes I think we should try to track the peak for each stage.  Even without 
thinking about overlapping stages, it would also be bad with stages run 
sequentially, but both within a heartbeat, with very different memory profiles

10 : stage 1 (high memory use) starts
11: val = 1e6
12: stage 1 ends
13: stage 2 starts (low memory use)
14: val = 1
15: hearbeat

we dont' want to make it look like stage 2 was using a ton of memory.

> ExecutorMetrics should poll faster than heartbeats
> --------------------------------------------------
>
>                 Key: SPARK-26329
>                 URL: https://issues.apache.org/jira/browse/SPARK-26329
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Web UI
>    Affects Versions: 3.0.0
>            Reporter: Imran Rashid
>            Priority: Major
>
> We should allow faster polling of the executor memory metrics (SPARK-23429 / 
> SPARK-23206) without requiring a faster heartbeat rate.  We've seen the 
> memory usage of executors pike over 1 GB in less than a second, but 
> heartbeats are only every 10 seconds (by default).  Spark needs to enable 
> fast polling to capture these peaks, without causing too much strain on the 
> system.
> In the current implementation, the metrics are polled along with the 
> heartbeat, but this leads to a slow rate of polling metrics by default.  If 
> users were to increase the rate of the heartbeat, they risk overloading the 
> driver on a large cluster, with too many messages and too much work to 
> aggregate the metrics.  But, the executor could poll the metrics more 
> frequently, and still only send the *max* since the last heartbeat for each 
> metric.  This keeps the load on the driver the same, and only introduces a 
> small overhead on the executor to grab the metrics and keep the max.
> The downside of this approach is that we still need to wait for the next 
> heartbeat for the driver to be aware of the new peak.   If the executor dies 
> or is killed before then, then we won't find out.  A potential future 
> enhancement would be to send an update *anytime* there is an increase by some 
> percentage, but we'll leave that out for now.
> Another possibility would be to change the metrics themselves to track peaks 
> for us, so we don't have to fine-tune the polling rate.  For example, some 
> jvm metrics provide a usage threshold, and notification: 
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold
> But, that is not available on all metrics.  This proposal gives us a generic 
> way to get a more accurate peak memory usage for *all* metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-26329) ExecutorMetrics should poll faster than heartbeats

Reply via email to