GitHub user steveloughran opened a pull request:
https://github.com/apache/spark/pull/9571
[SPARK-11373] [CORE] WiP Add metrics to the History Server and providers
This adds metrics to the history server, with the `FsHistoryProvider`
metering its load, performance and reliability.
The HistoryServer sets up the codahale metrics for the Web under metrics/
with metrics/metrics behind metrics, metrics/health any health probes and
metrics/threads a thread dump. There's currently no attempt to hook up JMX,
etc. The Web servlets are the ones tests can easily hit and don't need
infrastructure, so are the good initial first step.
It then passes the metrics and health registries down to the providers in a
`ApplicationHistoryBinding` case class, via a new method
def start(binding: ApplicationHistoryBinding): Unit
The base class has implementation so that all existing providers will still
link properly; the base implementation currently checks and fails
the use of a binding case class is also to ensure that if new binding
information were added in future, existing implementations would still link.
The `FsHistoryProvider` implements the `start()` method, registering two
counters and two timers.
1. Number of update attempts and number of failed updates âand the same
for app UI loads.
2. Time for updates and app UI loads.
Points of note
* Why not use Spark's `MetricsSystem`? I did start off with that, but it
needs a `SparkContext` to run off, which the server doesn't have. Ideally that
would be way to go, as it would support all the spark conf -based metrics
setup. Someone who understands the `MetricsSystem` would need to get involved
here as would make for a more complex patch. In `FsHistoryProvider` the
registry information is all kept in a `Source` subclass for ease of future
migration to `MetricsSystem`.
* Why the extra `HealthRegistry`? It's a nice way of allowing providers to
indicate (possibly transient) health problems for monitoring tools/clients to
hit. For the FS provider it could maybe flag when there hadn't been any
successful update for a specified time period. (that could also be indicated by
having a counter of "seconds since last update" and let monitoring tools
monitor the counter value and act on it). Access control problems to the
directory is something else which may be considered a liveness problem: it
won't get better without human intervention
* The `FsHistoryProvider.start()` method should really take the thread
start code from from class constructor's `initialize()` method. This would
ensure that incomplete classes don't get called by spawned threads, and makes
it possible for test-time subclasses to skip thread startup. I've not attempted
to do that in this patch.
* No tests for this yet. Hitting the three metrics servlets in the
HistoryServer is the obvious route; the JSON payload of the metrics can be
parsed and scanned for relevant counters too.
* Part of the patch for `HistoryServerSuite` removes the call to
`HistoryServer.initialize()` the `before` clause. That was a duplicate call,
one which hit the re-entrancy tests on the provider & registry. As well as
cutting it, `HistoryServer.initialize()` has been made idempotent. That should
not be needed -but it will eliminate the problem arising again.
Once the SPARK-1537 YARN timeline server history provider is committed,
then I'll add metrics support there too. The YARN timeline provider would:
1. Add timers of REST operations as well as playback load times, which can
count network delays as well as JSON deserialization overhead.
2. Add a health check for connectivity too: the timeline server would be
unhealthy if connections to the timeline server were either blocking or
failing. And again, if there were security/auth problems, they'd be considered
non-recoverable.
3. Move thread launch under the `start()` method, with some test subclasses
disabling thread launch.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/steveloughran/spark feature/SPARK-11373
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9571.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9571
----
commit 7ab9e3a3d2fdbcc11663176703874c573cfeafb8
Author: Steve Loughran <[email protected]>
Date: 2015-11-09T16:32:33Z
[SPARK-11373] First pass at adding metrics to the history server, with the
FsHistoryProvider counting
1. Number of update attempts and number of failed updates
2. Time for updates and app UI loads
The HistoryServer sets up the codahale metrics for the Web under metrics/
with metrics/metrics behind metrics, metrics/health any health probes and
metrics/threads a thread dump.
commit cb7cddb0443d4d8fac7d63e604fe303f5f57d5cd
Author: Steve Loughran <[email protected]>
Date: 2015-11-09T18:18:51Z
[SPARK-11373] tests and review; found and fixed a re-entrancy in
HistoryServerSuite which was causing problems with counter registration
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]