Hi Stavros, All,

Interesting topic, I add here some thoughts and personal opinions on it: I find 
too the metrics system quite useful for the use case of building Grafana 
dashboards as opposed to scraping logs and/or using the Event Listener 
infrastructure, as you mentioned in your mail.
A few additional points in favour of Dropwizard metrics for me are:

-          Regarding the metrics defined on the ExecutorSource, I believe they 
have better scalability compare to standard Task Metrics, as the Dropwizard 
metrics go directly from executors to sink(s) rather than passing via the 
driver through the ListenerBus.

-          Another advantage that I see is that Dropwizard metrics make it easy 
to expose information not available otherwise from the EveloLog/Listener 
events, such as executor.jvmCpuTime (SPARK-25228).

I ’d like to add some feedback and random thoughts based on recent work on 
SPARK-25228 and SPARK-22190, SPARK-25277, SPARK-25285:

-          the “Dropwizard metrics” space currently appears a bit “crowded”,  
we could probably profit from adding a few configuration parameters to turn 
some of the metrics on/off as needed (I see that this point is also raised in 
the discussion in your PR 22381).

-          Another point is that the metrics instrumentation is a bit scattered 
around the code, it would be nice to have a central point where the available 
metrics are exposed (maybe just in the documentation).

-          Testing of new metrics seems to be a bit of a manual process at the 
moment (at least it was for me) which could be improved. Related to that I 
believe that some recent work on adding new metrics has ended up with a minor 
side effect/issue, details in SPARK-25277.

Best regards,
Luca

From: Stavros Kontopoulos <stavros.kontopou...@lightbend.com>
Sent: Wednesday, September 12, 2018 22:35
To: Dev <dev@spark.apache.org>
Subject: [DISCUSS][CORE] Exposing application status metrics via a source

Hi all,

I have a PR https://github.com/apache/spark/pull/22381 that exposes application 
status
metrics (related jira: SPARK-25394).

So far metrics tooling needs to scrape the metrics rest api to get metrics like 
job delay, stages failed, stages completed etc.
From devops perspective it is good to standardize on a unified way of gathering 
metrics.
The need came up on the K8s side where jmx prometheus exporter is commonly used 
to scrape metrics for several components such as kafka, cassandra, but the need 
is not limited there.

Check comment 
here<https://github.com/apache/spark/pull/22381#issuecomment-420029771>:
"The rest api is great for UI and consolidated analytics, but monitoring 
through it is not as straightforward as when the data emits directly from the 
source like this. There is all kinds of nice context that we get when the data 
from this spark node is collected directly from the node itself, and not 
proxied through another collector / reporter. It is easier to build a 
monitoring data model across the cluster when node, jmx, pod, resource 
manifests, and spark data all align by virtue of coming from the same 
collector. Building a similar view of the cluster just from the rest api, as a 
comparison, is simply harder and quite challenging to do in general purpose 
terms."

The PR is ok to be merged but the major concern here is the mirroring of the 
metrics. I think that mirroring is ok since people may dont want to check the 
ui and they just want to integrate with jmx only (my use case) and gather 
metrics in grafana (common case out there).

Does any of the committers or the community have an opinion on this?
Is there an agreement about moving on with this? Note that the addition does 
not change much and can always be refactored if we come up with a new plan for 
the metrics story in the future.

Thanks,
Stavros

Reply via email to