squito commented on issue #24132: [SPARK-27189][CORE] Add executor-level memory 
usage metrics to the metrics system
URL: https://github.com/apache/spark/pull/24132#issuecomment-477654258
 
 
   @LucaCanali thanks for sharing that info.  this is a bit of a sidetrack here 
-- but can you explain why you want use the metrics system instead of just 
using spark's own tracking in event logs?  I always thought the major advantage 
was that if you had a distributed metrics collection mechanism, you could 
collect a much higher rate of metrics.  But from the blog post you linked to, 
it looks like you're only pulling data in every 10 seconds -- the same rate as 
spark's heartbeat.
   
   I suppose there are other advantages -- (1) there is a bigger ecosystem 
around graphite for plotting etc.  and (2) even though the driver gets updates 
every 10 seconds, we don't put them all into the event log to avoid huge event 
logs, we only record per-stage peaks.
   
   The major problem I have had in the past when I tried to use the metric 
system (which was a *long* time ago) is that it didn't know anything about 
tasks & scheduling.  so I'd wonder why one executor had a peak value much 
higher than others at time X, and only after a lot more digging I'd discover it 
was because it was the only one actually running a task at that time.  The 
spark-specific reporting, in the UI and what is in the event logs, handles that 
complication.
   
   I'm just trying to make sure we have a good explanation for why there are 
these different systems, the pros and cons of each, and what the end goal here 
should be.
   
   I do like that this approach here lets us easily take all the work for 
ExecutorMetrics and easily put them in the other system too, with a minimal 
amount of work.  I actually wonder if we should go the other way -- should 
there be a way to take everything from the MetricsSystem and put it into 
ExecutorMetrics as well?  Maybe not, just something to consider.
   
   Anyway, for this specific change -- I'm currently thinking the right thing 
to do is to wait for https://github.com/apache/spark/pull/23767, and then 
combine this and https://github.com/apache/spark/pull/23306 into one change.  
We should use Gauges, we should be able to expose metrics at a higher frequency 
than heartbeats, and we should avoid excessive polling by using our own version 
of `CachedGauge` which just references a shared metric snapshot.  Something like
   
   
   ```scala
   class ExecutorMetricGauges {
     // or maybe this needs to be an AtomicLongArray or an 
AtomicReference[Array[Long]], 
     // need to think about this ...
     @volatile var snapshot: Array[Long] = _
   
     // called by ExecutorMetricsPoller
     def updateSnapshot(snapshot: Array[Long]) ...
   
     private class ExecutorMetricGauge(idx: Int) extends Gauge[Long] {
       def getValue: Long = snapshot(idx)
     }
   
     // This looks like a bunch
     // of independent gauges as far the metric system is concerned, but
     // actually they're all using one shared snapshot.
     val gauges = (0 until ExecutorMetricType.numMetrics).map { idx => new 
ExecutorMetricGauge(idx) }.toIndexedSeq
   
     def register(registry: MetricRegistry): Unit = {
       ExecutorMetricType.metricToOffset.foreach { case (name, idx) =>
         registry.register(MetricRegistry.name(name), gauges(idx))
       }
     }
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to