LucaCanali commented on issue #24132: [SPARK-27189][CORE] Add executor-level memory usage metrics to the metrics system URL: https://github.com/apache/spark/pull/24132#issuecomment-477763522 @squito I like your proposed solution for implementing the proposed change/improvement. As for your question on why I find the metrics system useful, here are some comments: I like both the fine grained drill-down capabilities of the event log with all details of task metrics *and* the Dropwizard metrics system, which I find easy to use for building performance dashboards. SPARK-22190 has bridged part of the gap between the two, by exposing executor task metrics via the Dropwizard metrics system. I find SPARK-25228 (executor CPU time instrumentation) also useful. One of the advantages of the performance dashboard is to provide a unified view of what the system is doing at a particular point in time in terms of system resource utilization and key workload metrics, for example this allows to answer key questions like: how many active tasks are there now, how does this compare with the number of available cores? What fraction of the available task time is spent on CPU, or garbage collection, or shuffle activities and other activities? How much data are we reading/writing? How much memory is being used? BTW, not all time-based metrics are instrumented yet for a full time-based performance analysis,but what is there is already reasonably useful. Importantly the performance dashboard naturally displays data as graph over time, allowing studies performance and systems utilization over time. Notably in Spark (and distributed systems in general) there can be significat periods of time spent on low system activities (such as number of active tasks dropping to very low values despite the number of available cores) due to stragglers or data skew or several other possible reasons, that the dashboard identifies naturally. You may need task-level metrics data, as the eventlog, though to further drill down on the root causes. In terms of architecture I like that the dropwizard metrics system sends the metrics directly from the executors to the backend DB (graphite endpoint/influxDB in my case). Systems based on Spark listeners, as the eventlog, have to go via the driver and this can be a bit of a bottleneck in some cases (for example with many executors and many short tasks for example). I have tried to summarize some of the main points I have come across on this topic so far. I guess there is more out there and there is room to write a longer comment/blog/presentation at one point, maybe also to see if more people have opinions on this topics.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
