GitHub user steveloughran reopened a pull request:

    https://github.com/apache/spark/pull/9571

    [SPARK-11373] [CORE] Add metrics to the History Server and FsHistoryProvider

    This adds metrics to the history server, with the `FsHistoryProvider` 
metering its load, performance and reliability.
    
    see [SPARK-11373](https://issues.apache.org/jira/browse/SPARK-11373)
    
    The `HistoryServer` sets up the codahale metrics for the Web under metrics/ 
with metrics/metrics behind metrics, metrics/health any health probes and 
metrics/threads a thread dump. There's currently no attempt to  hook up JMX, 
etc. The Web servlets are the ones tests can easily hit and don't need 
infrastructure, so are the good initial first step.
    
    It then passes the metrics and health registries down to the providers in a 
`ApplicationHistoryBinding` case class, via a new method
    
        def start(binding: ApplicationHistoryBinding): Unit
    
    The base class has implementation so that all existing providers will still 
link properly; the base implementation currently checks and fails
     the use of a binding case class is also to ensure that if new binding 
information were added in future, existing implementations would still link.
    
    The `FsHistoryProvider` implements the `start()` method, registering two 
counters and two timers.
    
    1. Number of update attempts and number of failed updates —and the same 
for app UI loads.
    2. Time for updates and app UI loads.
    
    Points of note
    
    * Why not use Spark's `MetricsSystem`? I did start off with that, but it 
needs a `SparkContext` to run off, which the server doesn't have. Ideally that 
would be way to go, as it would support all the spark conf -based metrics 
setup. Someone who understands the `MetricsSystem` would need to get involved 
here as would make for a more complex patch. In `FsHistoryProvider` the 
registry information is all kept in a `Source` subclass for ease of future 
migration to `MetricsSystem`.
    * Why the extra `HealthRegistry`? It's a nice way of allowing providers to 
indicate (possibly transient) health problems for monitoring tools/clients to 
hit. For the FS provider it could maybe flag when there hadn't been any 
successful update for a specified time period. (that could also be indicated by 
having a counter of "seconds since last update" and let monitoring tools 
monitor the counter value and act on it). Access control problems to the 
directory is something else which may be considered a liveness problem: it 
won't get better without human intervention
    * The `FsHistoryProvider.start()` method should really take the thread 
start code from from class constructor's `initialize()` method. This would 
ensure that incomplete classes don't get called by spawned threads, and makes 
it possible for test-time subclasses to skip thread startup. I've not attempted 
to do that in this patch.
    * No tests for this yet. Hitting the three metrics servlets in the 
HistoryServer is the obvious route; the JSON payload of the metrics can be 
parsed and scanned for relevant counters too. 
    * Part of the patch for `HistoryServerSuite` removes the call to 
`HistoryServer.initialize()` the `before` clause. That was a duplicate call, 
one which hit the re-entrancy tests on the provider & registry. As well as 
cutting it, `HistoryServer.initialize()` has been made idempotent. That should 
not be needed -but it will eliminate the problem arising again.
    
    Once the SPARK-1537 YARN timeline server history provider is committed, 
then I'll add metrics support there too. The YARN timeline provider would:
    
    1. Add timers of REST operations as well as playback load times, which can 
count network delays as well as JSON deserialization overhead. 
    2. Add a health check for connectivity too: the timeline server would be 
unhealthy if connections to the timeline server were either blocking or 
failing. And again, if there were security/auth problems, they'd be considered 
non-recoverable.
    3. Move thread launch under the `start()` method, with some test subclasses 
disabling thread launch.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/steveloughran/spark feature/SPARK-11373

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9571.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9571
    
----
commit 2ae734fd358c5d33e75f92837c2e414dc620454a
Author: Steve Loughran <[email protected]>
Date:   2015-11-09T16:32:33Z

    [SPARK-11373] First pass at adding metrics to the history server, with the 
FsHistoryProvider counting
    1. Number of update attempts and number of failed updates
    2. Time for updates and app UI loads
    
    The HistoryServer sets up the codahale metrics for the Web under metrics/ 
with metrics/metrics behind metrics, metrics/health any health probes and 
metrics/threads a thread dump.

commit 31777992e151b66bfe53836d9de1e00ccc743a1e
Author: Steve Loughran <[email protected]>
Date:   2015-11-09T18:18:51Z

    [SPARK-11373] tests and review; found and fixed a re-entrancy in 
HistoryServerSuite which was causing problems with counter registration

commit 2d3066fbb09638626a11e3f68f88c4f0ecca591a
Author: Steve Loughran <[email protected]>
Date:   2015-11-09T19:26:30Z

    [SPARK-11373] scala-style javadoc tuning in FsHistoryProvider

commit 8c2f0e35202d8382313847c2432c4f8dd050e41f
Author: Steve Loughran <[email protected]>
Date:   2015-11-26T21:18:38Z

    SPARK-11373: move to MetricsSystem, though retaining/expanding health 
checks, which aren't part of it

commit 64fc88c054226fd419152bfa76f4877de2672134
Author: Steve Loughran <[email protected]>
Date:   2015-11-26T21:33:28Z

    scalacheck

commit 69c334bca42d9d26b0b8d9576853ddd390965c85
Author: Steve Loughran <[email protected]>
Date:   2015-11-26T21:57:42Z

    SPARK-11373: cut out the notion of binding information; simplify provider 
launch

commit d9ba0d2a33e458d89f7430915208798abacfe75d
Author: Steve Loughran <[email protected]>
Date:   2015-12-09T18:06:53Z

    SPARK-11373 cut the health checks out

commit 8324cccc371ceb63918daa2fc64c7c665204deeb
Author: Steve Loughran <[email protected]>
Date:   2016-02-05T14:43:00Z

    [SPARK-11373] tail end of rebase operation

commit b1d3baa2674d210a64820574be8955769bf50ee3
Author: Steve Loughran <[email protected]>
Date:   2016-02-08T12:37:40Z

    [SPARK-11373] scalastyle and import ordering

commit c68823e787184675bae42cf83b0b794ca251ca7e
Author: Steve Loughran <[email protected]>
Date:   2016-02-12T13:18:40Z

    SPARK-11373 finish review of merge with trunk; add new Timestamp gauge and 
set to time of update/update success. Makes update time visible for 
tests/users/ops

commit d16a3e9f757caf88c4e0595807b7b6e452646cf4
Author: Steve Loughran <[email protected]>
Date:   2016-04-26T17:50:44Z

    [SPARK-11373] finish rebasing to master, correct tightened style checks

commit ecd0ceed776e0264cd3e846dbaaf6151c12640b6
Author: Steve Loughran <[email protected]>
Date:   2016-06-10T14:10:46Z

    [SPARK-11373] Address review comments and add a new counter of events 
played back during merge and U load

commit f086242e4c9ac75b828c34cae650415196119e10
Author: Steve Loughran <[email protected]>
Date:   2016-07-14T13:29:38Z

    [SPARK-11373] more metrics, all sources have prefixes, common features 
pulled up into HistoryMetricSource, including registration and prefix support. 
The existing test case "incomplete apps get refreshed" has been extended to 
check for metrics in the listings, alongside some specific probes for values in 
the fs history provider (loads, load time average > 0)

commit 0cf532ec34e328e71b3495c64e0e15ed13a254b2
Author: Steve Loughran <[email protected]>
Date:   2016-07-14T13:35:20Z

    [SPARK-11373] add a check that before there's been any events loaded, the 
average load time is "0" and not a division by zero error. This validates the 
logic in the relevant lambda-expression

commit c8b4912dccd7bbff1c3457592227b07de8c7e9da
Author: Steve Loughran <[email protected]>
Date:   2016-07-14T13:48:50Z

    [SPARK-11373] style check in the javadocs

commit aa1aada8c898e773dc8af88f580a906113e95da1
Author: Steve Loughran <[email protected]>
Date:   2016-07-14T13:57:18Z

    [SPARK-11373] IDE had mysteriously re-ordered imports in the same line. 
Interesting, and not in a good way

commit 745d532b67b64149e86d2fdf0464cab616023eac
Author: Steve Loughran <[email protected]>
Date:   2016-08-16T18:50:19Z

    [SPARK-11373] address latest comments
    -nits
    -recommended minor changes
    +pull out lambda gauge and HistoryMetricSource into their own file, 
HistoryMetricSource.scala. Added tests to go with this, and made the toString 
call robust against failing gauges
    +fixed a scaladoc warning in HistoryServer

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to