Imran Rashid created SPARK-26329:
------------------------------------
Summary: ExecutorMetrics should poll faster than heartbeats
Key: SPARK-26329
URL: https://issues.apache.org/jira/browse/SPARK-26329
Project: Spark
Issue Type: Improvement
Components: Spark Core, Web UI
Affects Versions: 3.0.0
Reporter: Imran Rashid
We should allow faster polling of the executor memory metrics (SPARK-23429 /
SPARK-23206) without requiring a faster heartbeat rate. We've seen the memory
usage of executors pike over 1 GB in less than a second, but heartbeats are
only every 10 seconds (by default). Spark needs to enable fast polling to
capture these peaks, without causing too much strain on the system.
In the current implementation, the metrics are polled along with the heartbeat,
but this leads to a slow rate of polling metrics by default. If users were to
increase the rate of the heartbeat, they risk overloading the driver on a large
cluster, with too many messages and too much work to aggregate the metrics.
But, the executor could poll the metrics more frequently, and still only send
the *max* since the last heartbeat for each metric. This keeps the load on the
driver the same, and only introduces a small overhead on the executor to grab
the metrics and keep the max.
The downside of this approach is that we still need to wait for the next
heartbeat for the driver to be aware of the new peak. If the executor dies or
is killed before then, then we won't find out. A potential future enhancement
would be to send an update *anytime* there is an increase by some percentage,
but we'll leave that out for now.
Another possibility would be to change the metrics themselves to track peaks
for us, so we don't have to fine-tune the polling rate. For example, some jvm
metrics provide a usage threshold, and notification:
https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold
But, that is not available on all metrics. This proposal gives us a generic
way to get a more accurate peak memory usage for *all* metrics.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]