[
https://issues.apache.org/jira/browse/SPARK-26329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715520#comment-16715520
]
Imran Rashid commented on SPARK-26329:
--------------------------------------
oh good point, I totally forgot about that complication we had to work through
on the driver side.
yes I think we should try to track the peak for each stage. Even without
thinking about overlapping stages, it would also be bad with stages run
sequentially, but both within a heartbeat, with very different memory profiles
10 : stage 1 (high memory use) starts
11: val = 1e6
12: stage 1 ends
13: stage 2 starts (low memory use)
14: val = 1
15: hearbeat
we dont' want to make it look like stage 2 was using a ton of memory.
> ExecutorMetrics should poll faster than heartbeats
> --------------------------------------------------
>
> Key: SPARK-26329
> URL: https://issues.apache.org/jira/browse/SPARK-26329
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, Web UI
> Affects Versions: 3.0.0
> Reporter: Imran Rashid
> Priority: Major
>
> We should allow faster polling of the executor memory metrics (SPARK-23429 /
> SPARK-23206) without requiring a faster heartbeat rate. We've seen the
> memory usage of executors pike over 1 GB in less than a second, but
> heartbeats are only every 10 seconds (by default). Spark needs to enable
> fast polling to capture these peaks, without causing too much strain on the
> system.
> In the current implementation, the metrics are polled along with the
> heartbeat, but this leads to a slow rate of polling metrics by default. If
> users were to increase the rate of the heartbeat, they risk overloading the
> driver on a large cluster, with too many messages and too much work to
> aggregate the metrics. But, the executor could poll the metrics more
> frequently, and still only send the *max* since the last heartbeat for each
> metric. This keeps the load on the driver the same, and only introduces a
> small overhead on the executor to grab the metrics and keep the max.
> The downside of this approach is that we still need to wait for the next
> heartbeat for the driver to be aware of the new peak. If the executor dies
> or is killed before then, then we won't find out. A potential future
> enhancement would be to send an update *anytime* there is an increase by some
> percentage, but we'll leave that out for now.
> Another possibility would be to change the metrics themselves to track peaks
> for us, so we don't have to fine-tune the polling rate. For example, some
> jvm metrics provide a usage threshold, and notification:
> https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold
> But, that is not available on all metrics. This proposal gives us a generic
> way to get a more accurate peak memory usage for *all* metrics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]