Imran Rashid created SPARK-26329:
------------------------------------

             Summary: ExecutorMetrics should poll faster than heartbeats
                 Key: SPARK-26329
                 URL: https://issues.apache.org/jira/browse/SPARK-26329
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, Web UI
    Affects Versions: 3.0.0
            Reporter: Imran Rashid


We should allow faster polling of the executor memory metrics (SPARK-23429 / 
SPARK-23206) without requiring a faster heartbeat rate.  We've seen the memory 
usage of executors pike over 1 GB in less than a second, but heartbeats are 
only every 10 seconds (by default).  Spark needs to enable fast polling to 
capture these peaks, without causing too much strain on the system.

In the current implementation, the metrics are polled along with the heartbeat, 
but this leads to a slow rate of polling metrics by default.  If users were to 
increase the rate of the heartbeat, they risk overloading the driver on a large 
cluster, with too many messages and too much work to aggregate the metrics.  
But, the executor could poll the metrics more frequently, and still only send 
the *max* since the last heartbeat for each metric.  This keeps the load on the 
driver the same, and only introduces a small overhead on the executor to grab 
the metrics and keep the max.

The downside of this approach is that we still need to wait for the next 
heartbeat for the driver to be aware of the new peak.   If the executor dies or 
is killed before then, then we won't find out.  A potential future enhancement 
would be to send an update *anytime* there is an increase by some percentage, 
but we'll leave that out for now.

Another possibility would be to change the metrics themselves to track peaks 
for us, so we don't have to fine-tune the polling rate.  For example, some jvm 
metrics provide a usage threshold, and notification: 
https://docs.oracle.com/javase/7/docs/api/java/lang/management/MemoryPoolMXBean.html#UsageThreshold

But, that is not available on all metrics.  This proposal gives us a generic 
way to get a more accurate peak memory usage for *all* metrics.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to