GitHub user steveloughran opened a pull request:

    https://github.com/apache/spark/pull/9571

    [SPARK-11373] [CORE] WiP Add metrics to the History Server and providers

    
    This adds metrics to the history server, with the `FsHistoryProvider` 
metering its load, performance and reliability.
    
    The HistoryServer sets up the codahale metrics for the Web under metrics/ 
with metrics/metrics behind metrics, metrics/health any health probes and 
metrics/threads a thread dump. There's currently no attempt to  hook up JMX, 
etc. The Web servlets are the ones tests can easily hit and don't need 
infrastructure, so are the good initial first step.
    
    It then passes the metrics and health registries down to the providers in a 
`ApplicationHistoryBinding` case class, via a new method
    
        def start(binding: ApplicationHistoryBinding): Unit
    
    The base class has implementation so that all existing providers will still 
link properly; the base implementation currently checks and fails
     the use of a binding case class is also to ensure that if new binding 
information were added in future, existing implementations would still link.
    
    The `FsHistoryProvider` implements the `start()` method, registering two 
counters and two timers.
    
    1. Number of update attempts and number of failed updates —and the same 
for app UI loads.
    2. Time for updates and app UI loads.
    
    Points of note
    
    * Why not use Spark's `MetricsSystem`? I did start off with that, but it 
needs a `SparkContext` to run off, which the server doesn't have. Ideally that 
would be way to go, as it would support all the spark conf -based metrics 
setup. Someone who understands the `MetricsSystem` would need to get involved 
here as would make for a more complex patch. In `FsHistoryProvider` the 
registry information is all kept in a `Source` subclass for ease of future 
migration to `MetricsSystem`.
    * Why the extra `HealthRegistry`? It's a nice way of allowing providers to 
indicate (possibly transient) health problems for monitoring tools/clients to 
hit. For the FS provider it could maybe flag when there hadn't been any 
successful update for a specified time period. (that could also be indicated by 
having a counter of "seconds since last update" and let monitoring tools 
monitor the counter value and act on it). Access control problems to the 
directory is something else which may be considered a liveness problem: it 
won't get better without human intervention
    * The `FsHistoryProvider.start()` method should really take the thread 
start code from from class constructor's `initialize()` method. This would 
ensure that incomplete classes don't get called by spawned threads, and makes 
it possible for test-time subclasses to skip thread startup. I've not attempted 
to do that in this patch.
    * No tests for this yet. Hitting the three metrics servlets in the 
HistoryServer is the obvious route; the JSON payload of the metrics can be 
parsed and scanned for relevant counters too. 
    * Part of the patch for `HistoryServerSuite` removes the call to 
`HistoryServer.initialize()` the `before` clause. That was a duplicate call, 
one which hit the re-entrancy tests on the provider & registry. As well as 
cutting it, `HistoryServer.initialize()` has been made idempotent. That should 
not be needed -but it will eliminate the problem arising again.
    
    Once the SPARK-1537 YARN timeline server history provider is committed, 
then I'll add metrics support there too. The YARN timeline provider would:
    
    1. Add timers of REST operations as well as playback load times, which can 
count network delays as well as JSON deserialization overhead. 
    2. Add a health check for connectivity too: the timeline server would be 
unhealthy if connections to the timeline server were either blocking or 
failing. And again, if there were security/auth problems, they'd be considered 
non-recoverable.
    3. Move thread launch under the `start()` method, with some test subclasses 
disabling thread launch.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark feature/SPARK-11373

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9571
    
----
commit 7ab9e3a3d2fdbcc11663176703874c573cfeafb8
Author: Steve Loughran <[email protected]>
Date:   2015-11-09T16:32:33Z

    [SPARK-11373] First pass at adding metrics to the history server, with the 
FsHistoryProvider counting
    1. Number of update attempts and number of failed updates
    2. Time for updates and app UI loads
    
    The HistoryServer sets up the codahale metrics for the Web under metrics/ 
with metrics/metrics behind metrics, metrics/health any health probes and 
metrics/threads a thread dump.

commit cb7cddb0443d4d8fac7d63e604fe303f5f57d5cd
Author: Steve Loughran <[email protected]>
Date:   2015-11-09T18:18:51Z

    [SPARK-11373] tests and review; found and fixed a re-entrancy in 
HistoryServerSuite which was causing problems with counter registration

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to