squito commented on issue #24132: [SPARK-27189][CORE] Add executor-level memory usage metrics to the metrics system URL: https://github.com/apache/spark/pull/24132#issuecomment-477654258 @LucaCanali thanks for sharing that info. this is a bit of a sidetrack here -- but can you explain why you want use the metrics system instead of just using spark's own tracking in event logs? I always thought the major advantage was that if you had a distributed metrics collection mechanism, you could collect a much higher rate of metrics. But from the blog post you linked to, it looks like you're only pulling data in every 10 seconds -- the same rate as spark's heartbeat. I suppose there are other advantages -- (1) there is a bigger ecosystem around graphite for plotting etc. and (2) even though the driver gets updates every 10 seconds, we don't put them all into the event log to avoid huge event logs, we only record per-stage peaks. The major problem I have had in the past when I tried to use the metric system (which was a *long* time ago) is that it didn't know anything about tasks & scheduling. so I'd wonder why one executor had a peak value much higher than others at time X, and only after a lot more digging I'd discover it was because it was the only one actually running a task at that time. The spark-specific reporting, in the UI and what is in the event logs, handles that complication. I'm just trying to make sure we have a good explanation for why there are these different systems, the pros and cons of each, and what the end goal here should be. I do like that this approach here lets us easily take all the work for ExecutorMetrics and easily put them in the other system too, with a minimal amount of work. I actually wonder if we should go the other way -- should there be a way to take everything from the MetricsSystem and put it into ExecutorMetrics as well? Maybe not, just something to consider. Anyway, for this specific change -- I'm currently thinking the right thing to do is to wait for https://github.com/apache/spark/pull/23767, and then combine this and https://github.com/apache/spark/pull/23306 into one change. We should use Gauges, we should be able to expose metrics at a higher frequency than heartbeats, and we should avoid excessive polling by using our own version of `CachedGauge` which just references a shared metric snapshot. Something like ```scala class ExecutorMetricGauges { // or maybe this needs to be an AtomicLongArray or an AtomicReference[Array[Long]], // need to think about this ... @volatile var snapshot: Array[Long] = _ // called by ExecutorMetricsPoller def updateSnapshot(snapshot: Array[Long]) ... private class ExecutorMetricGauge(idx: Int) extends Gauge[Long] { def getValue: Long = snapshot(idx) } // This looks like a bunch // of independent gauges as far the metric system is concerned, but // actually they're all using one shared snapshot. val gauges = (0 until ExecutorMetricType.numMetrics).map { idx => new ExecutorMetricGauge(idx) }.toIndexedSeq def register(registry: MetricRegistry): Unit = { ExecutorMetricType.metricToOffset.foreach { case (name, idx) => registry.register(MetricRegistry.name(name), gauges(idx)) } } } ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
