Luca Canali created SPARK-28091:
-----------------------------------
Summary: Extend Spark metrics system with executor plugin metrics
Key: SPARK-28091
URL: https://issues.apache.org/jira/browse/SPARK-28091
Project: Spark
Issue Type: Improvement
Components: Spark Core
Affects Versions: 3.0.0
Reporter: Luca Canali
This proposes to improve Spark instrumentation by adding a hook for Spark
executor plugin metrics to the Spark metrics systems implemented with the
Dropwizard/Codahale library.
Context: The Spark metrics system provides a large variety of metrics, see also
SPARK-26890, useful to monitor and troubleshoot Spark workloads. A typical
workflow is to sink the metrics to a storage system and build dashboards on top
of that.
Improvement: The original goal of this work was to add instrumentation for S3
filesystem access metrics by Spark job. Currently, [[ExecutorSource]]
instruments HDFS and local filesystem metrics. Rather than extending the code
there, we proposes to add a metrics plugin system which is of more flexible and
general use.
Advantages:
* The metric plugin system makes it easy to implement instrumentation for S3
access by Spark jobs.
* The metrics plugin system allows for easy extensions of how Spark collects
HDFS-related workload metrics. This is currently done using the Hadoop
Filesystem GetAllStatistics method, which is deprecated in recent versions of
Hadoop. Recent versions of Hadoop Filesystem recommend using method
GetGlobalStorageStatistics, which also provides several additional metrics.
GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been introduced
in Hadoop 2.8). Using a metric plugin for Spark would allow an easy way to “opt
in” using such new API calls for those deploying suitable Hadoop versions.
* We also have the use case of adding Hadoop filesystem monitoring for a
custom Hadoop compliant filesystem in use in our organization (EOS using the
XRootD protocol). The metrics plugin infrastructure makes this easy to do.
Others may have similar use cases.
* More generally, this method makes it straightforward to plug in Filesystem
and other metrics to the Spark monitoring system. Future work on plugin
implementation can address extending monitoring to measure usage of external
resources (OS, filesystem, network, accelerator cards, etc), that maybe would
not normally be considered general enough for inclusion in Apache Spark code,
but that can be nevertheless useful for specialized use cases, tests or
troubleshooting.
Implementation:
The proposed implementation is currently a WIP open for comments and
improvements. It is based on the work on Executor Plugin of SPARK-24918 and
builds on recent work on extending Spark executor metrics, such as SPARK-25228
Tests and examples:
This has been so far manually tested running Spark on YARN and K8S clusters, in
particular for monitoring S3 and for extending HDFS instrumentation with the
Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric plugin
example and code used for testing are available.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]