GitHub user steveloughran reopened a pull request:
https://github.com/apache/spark/pull/9571
[SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider
This adds metrics to the history server, with the `FsHistoryProvider`
metering its load, performance and reliability.
see [SPARK-11373](https://issues.apache.org/jira/browse/SPARK-11373)
The `HistoryServer` sets up the codahale metrics for the Web under metrics/
with metrics/metrics behind metrics, metrics/health any health probes and
metrics/threads a thread dump. There's currently no attempt to hook up JMX,
etc. The Web servlets are the ones tests can easily hit and don't need
infrastructure, so are the good initial first step.
It then passes the metrics and health registries down to the providers in a
`ApplicationHistoryBinding` case class, via a new method
def start(binding: ApplicationHistoryBinding): Unit
The base class has implementation so that all existing providers will still
link properly; the base implementation currently checks and fails
the use of a binding case class is also to ensure that if new binding
information were added in future, existing implementations would still link.
The `FsHistoryProvider` implements the `start()` method, registering two
counters and two timers.
1. Number of update attempts and number of failed updates âand the same
for app UI loads.
2. Time for updates and app UI loads.
Points of note
* Why not use Spark's `MetricsSystem`? I did start off with that, but it
needs a `SparkContext` to run off, which the server doesn't have. Ideally that
would be way to go, as it would support all the spark conf -based metrics
setup. Someone who understands the `MetricsSystem` would need to get involved
here as would make for a more complex patch. In `FsHistoryProvider` the
registry information is all kept in a `Source` subclass for ease of future
migration to `MetricsSystem`.
* Why the extra `HealthRegistry`? It's a nice way of allowing providers to
indicate (possibly transient) health problems for monitoring tools/clients to
hit. For the FS provider it could maybe flag when there hadn't been any
successful update for a specified time period. (that could also be indicated by
having a counter of "seconds since last update" and let monitoring tools
monitor the counter value and act on it). Access control problems to the
directory is something else which may be considered a liveness problem: it
won't get better without human intervention
* The `FsHistoryProvider.start()` method should really take the thread
start code from from class constructor's `initialize()` method. This would
ensure that incomplete classes don't get called by spawned threads, and makes
it possible for test-time subclasses to skip thread startup. I've not attempted
to do that in this patch.
* No tests for this yet. Hitting the three metrics servlets in the
HistoryServer is the obvious route; the JSON payload of the metrics can be
parsed and scanned for relevant counters too.
* Part of the patch for `HistoryServerSuite` removes the call to
`HistoryServer.initialize()` the `before` clause. That was a duplicate call,
one which hit the re-entrancy tests on the provider & registry. As well as
cutting it, `HistoryServer.initialize()` has been made idempotent. That should
not be needed -but it will eliminate the problem arising again.
Once the SPARK-1537 YARN timeline server history provider is committed,
then I'll add metrics support there too. The YARN timeline provider would:
1. Add timers of REST operations as well as playback load times, which can
count network delays as well as JSON deserialization overhead.
2. Add a health check for connectivity too: the timeline server would be
unhealthy if connections to the timeline server were either blocking or
failing. And again, if there were security/auth problems, they'd be considered
non-recoverable.
3. Move thread launch under the `start()` method, with some test subclasses
disabling thread launch.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/spark feature/SPARK-11373
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9571.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9571
----
commit 2ae734fd358c5d33e75f92837c2e414dc620454a
Author: Steve Loughran <[email protected]>
Date: 2015-11-09T16:32:33Z
[SPARK-11373] First pass at adding metrics to the history server, with the
FsHistoryProvider counting
1. Number of update attempts and number of failed updates
2. Time for updates and app UI loads
The HistoryServer sets up the codahale metrics for the Web under metrics/
with metrics/metrics behind metrics, metrics/health any health probes and
metrics/threads a thread dump.
commit 31777992e151b66bfe53836d9de1e00ccc743a1e
Author: Steve Loughran <[email protected]>
Date: 2015-11-09T18:18:51Z
[SPARK-11373] tests and review; found and fixed a re-entrancy in
HistoryServerSuite which was causing problems with counter registration
commit 2d3066fbb09638626a11e3f68f88c4f0ecca591a
Author: Steve Loughran <[email protected]>
Date: 2015-11-09T19:26:30Z
[SPARK-11373] scala-style javadoc tuning in FsHistoryProvider
commit 8c2f0e35202d8382313847c2432c4f8dd050e41f
Author: Steve Loughran <[email protected]>
Date: 2015-11-26T21:18:38Z
SPARK-11373: move to MetricsSystem, though retaining/expanding health
checks, which aren't part of it
commit 64fc88c054226fd419152bfa76f4877de2672134
Author: Steve Loughran <[email protected]>
Date: 2015-11-26T21:33:28Z
scalacheck
commit 69c334bca42d9d26b0b8d9576853ddd390965c85
Author: Steve Loughran <[email protected]>
Date: 2015-11-26T21:57:42Z
SPARK-11373: cut out the notion of binding information; simplify provider
launch
commit d9ba0d2a33e458d89f7430915208798abacfe75d
Author: Steve Loughran <[email protected]>
Date: 2015-12-09T18:06:53Z
SPARK-11373 cut the health checks out
commit 8324cccc371ceb63918daa2fc64c7c665204deeb
Author: Steve Loughran <[email protected]>
Date: 2016-02-05T14:43:00Z
[SPARK-11373] tail end of rebase operation
commit b1d3baa2674d210a64820574be8955769bf50ee3
Author: Steve Loughran <[email protected]>
Date: 2016-02-08T12:37:40Z
[SPARK-11373] scalastyle and import ordering
commit c68823e787184675bae42cf83b0b794ca251ca7e
Author: Steve Loughran <[email protected]>
Date: 2016-02-12T13:18:40Z
SPARK-11373 finish review of merge with trunk; add new Timestamp gauge and
set to time of update/update success. Makes update time visible for
tests/users/ops
commit d16a3e9f757caf88c4e0595807b7b6e452646cf4
Author: Steve Loughran <[email protected]>
Date: 2016-04-26T17:50:44Z
[SPARK-11373] finish rebasing to master, correct tightened style checks
commit ecd0ceed776e0264cd3e846dbaaf6151c12640b6
Author: Steve Loughran <[email protected]>
Date: 2016-06-10T14:10:46Z
[SPARK-11373] Address review comments and add a new counter of events
played back during merge and U load
commit f086242e4c9ac75b828c34cae650415196119e10
Author: Steve Loughran <[email protected]>
Date: 2016-07-14T13:29:38Z
[SPARK-11373] more metrics, all sources have prefixes, common features
pulled up into HistoryMetricSource, including registration and prefix support.
The existing test case "incomplete apps get refreshed" has been extended to
check for metrics in the listings, alongside some specific probes for values in
the fs history provider (loads, load time average > 0)
commit 0cf532ec34e328e71b3495c64e0e15ed13a254b2
Author: Steve Loughran <[email protected]>
Date: 2016-07-14T13:35:20Z
[SPARK-11373] add a check that before there's been any events loaded, the
average load time is "0" and not a division by zero error. This validates the
logic in the relevant lambda-expression
commit c8b4912dccd7bbff1c3457592227b07de8c7e9da
Author: Steve Loughran <[email protected]>
Date: 2016-07-14T13:48:50Z
[SPARK-11373] style check in the javadocs
commit aa1aada8c898e773dc8af88f580a906113e95da1
Author: Steve Loughran <[email protected]>
Date: 2016-07-14T13:57:18Z
[SPARK-11373] IDE had mysteriously re-ordered imports in the same line.
Interesting, and not in a good way
commit 745d532b67b64149e86d2fdf0464cab616023eac
Author: Steve Loughran <[email protected]>
Date: 2016-08-16T18:50:19Z
[SPARK-11373] address latest comments
-nits
-recommended minor changes
+pull out lambda gauge and HistoryMetricSource into their own file,
HistoryMetricSource.scala. Added tests to go with this, and made the toString
call robust against failing gauges
+fixed a scaladoc warning in HistoryServer
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]