[ 
https://issues.apache.org/jira/browse/SPARK-28091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16872075#comment-16872075
 ] 

Luca Canali commented on SPARK-28091:
-------------------------------------

Thank you [[email protected]] for your comment and clarifications. Indeed what 
I am trying to do is to collect executor-level metrics for S3A (and also for 
other Hadoop compatible filesystems of interest). The goal is to bring the 
metrics into the Spark metrics system, so that they can be used, for example, 
in a performance dashboard and displayed together with the rest of the 
instrumentation metrics.

The original work for this started from the need of measuring I/O metrics for a 
custom HDFS-compatible filesystem that we use (called ROOT:) and more recently 
also for S3A. The first implementation we did was a simple: a small change in 
[[ExecutorSource]], which already has code to collect metrics for "hdfs" and 
"file"/local filesystems at the executor level, obviously that code is very 
easy to extend, however it feels like a short-term hack going that way.

My though with this PR is to provide a flexible method to add instrumentation, 
profiting from our current use case related to I/O workload monitoring, but 
also open to several other use cases. 
I am also quite interested to see developmnets in this area for CPU counters 
and possibly also GPU-related instrumentation.
I think the proposal to use executor plugins for this goes in the original 
direction outlined by [~irashid] and collaborators with SPARK-24918

I add some links to reference and related material: code of a few test executor 
metrics plugins that I am developing: 
[https://github.com/cerndb/SparkExecutorPlugins] 
The general idea of how to build a dashboard with Spark metrics is described in 
[https://db-blog.web.cern.ch/blog/luca-canali/2019-02-performance-dashboard-apache-spark]

> Extend Spark metrics system with executor plugin metrics
> --------------------------------------------------------
>
>                 Key: SPARK-28091
>                 URL: https://issues.apache.org/jira/browse/SPARK-28091
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Luca Canali
>            Priority: Minor
>
> This proposes to improve Spark instrumentation by adding a hook for Spark 
> executor plugin metrics to the Spark metrics systems implemented with the 
> Dropwizard/Codahale library.
> Context: The Spark metrics system provides a large variety of metrics, see 
> also SPARK-26890, useful to  monitor and troubleshoot Spark workloads. A 
> typical workflow is to sink the metrics to a storage system and build 
> dashboards on top of that.
> Improvement: The original goal of this work was to add instrumentation for S3 
> filesystem access metrics by Spark job. Currently, [[ExecutorSource]] 
> instruments HDFS and local filesystem metrics. Rather than extending the code 
> there, we proposes to add a metrics plugin system which is of more flexible 
> and general use.
> Advantages:
>  * The metric plugin system makes it easy to implement instrumentation for S3 
> access by Spark jobs.
>  * The metrics plugin system allows for easy extensions of how Spark collects 
> HDFS-related workload metrics. This is currently done using the Hadoop 
> Filesystem GetAllStatistics method, which is deprecated in recent versions of 
> Hadoop. Recent versions of Hadoop Filesystem recommend using method 
> GetGlobalStorageStatistics, which also provides several additional metrics. 
> GetGlobalStorageStatistics is not available in Hadoop 2.7 (had been 
> introduced in Hadoop 2.8). Using a metric plugin for Spark would allow an 
> easy way to “opt in” using such new API calls for those deploying suitable 
> Hadoop versions.
>  * We also have the use case of adding Hadoop filesystem monitoring for a 
> custom Hadoop compliant filesystem in use in our organization (EOS using the 
> XRootD protocol). The metrics plugin infrastructure makes this easy to do. 
> Others may have similar use cases.
>  * More generally, this method makes it straightforward to plug in Filesystem 
> and other metrics to the Spark monitoring system. Future work on plugin 
> implementation can address extending monitoring to measure usage of external 
> resources (OS, filesystem, network, accelerator cards, etc), that maybe would 
> not normally be considered general enough for inclusion in Apache Spark code, 
> but that can be nevertheless useful for specialized use cases, tests or 
> troubleshooting.
> Implementation:
> The proposed implementation is currently a WIP open for comments and 
> improvements. It is based on the work on Executor Plugin of SPARK-24918 and 
> builds on recent work on extending Spark executor metrics, such as SPARK-25228
> Tests and examples:
> This has been so far manually tested running Spark on YARN and K8S clusters, 
> in particular for monitoring S3 and for extending HDFS instrumentation with 
> the Hadoop Filesystem “GetGlobalStorageStatistics” metrics. Executor metric 
> plugin example and code used for testing are available.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to