LucaCanali commented on issue #24132: [SPARK-27189][CORE] Add executor-level 
memory usage metrics to the metrics system
URL: https://github.com/apache/spark/pull/24132#issuecomment-477763522
 
 
   @squito I like your proposed solution for implementing the proposed 
change/improvement.
   
   As for your question on why I find the metrics system useful, here are some 
comments:
   
   I like both the fine grained drill-down capabilities of the event log with 
all details of task metrics  *and* the Dropwizard metrics system, which I find 
easy to use for building performance dashboards. 
   SPARK-22190 has bridged part of the gap between the two, by exposing 
executor task metrics via the Dropwizard metrics system. I find SPARK-25228 
(executor CPU time instrumentation) also useful.
   
   One of the advantages of the performance dashboard is to provide a unified 
view of what the system is doing at a particular point in time in terms of 
system resource utilization and key workload metrics, for example this allows 
to answer key questions like: how many active tasks are there now, how does 
this compare with the number of available cores? What fraction of the available 
task time is spent on CPU, or garbage collection, or shuffle activities and 
other activities? How much data are we reading/writing? How much memory is 
being used? BTW, not all time-based metrics are instrumented yet for a full 
time-based performance analysis,but what is there is already reasonably useful. 
   Importantly the performance dashboard naturally displays data as graph over 
time, allowing studies performance and systems utilization over time. 
   
   Notably in Spark (and distributed systems in general) there can be 
significat periods of time spent on low system activities (such as number of 
active tasks dropping to very low values despite the number of available cores) 
due to stragglers or data skew or several other possible reasons, that the 
dashboard identifies naturally. You may need task-level metrics data, as the 
eventlog, though to further drill down on the root causes.
   
   In terms of architecture I like that the dropwizard metrics system sends the 
metrics directly from the executors to the backend DB (graphite 
endpoint/influxDB in my case). Systems based on Spark listeners, as the 
eventlog, have to go via the driver and this can be a bit of a bottleneck in 
some cases (for example with many executors and many short tasks for example).
   
   I have tried to summarize some of the main points I have come across on this 
topic so far. I guess there is more out there and there is room to write a 
longer comment/blog/presentation at one point, maybe also to see if more people 
have opinions on this topics.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to