Github user steveloughran commented on the issue:
https://github.com/apache/spark/pull/9571
This patch adds separate average values of the load times vs merge times
per event; this shows ~2x difference in replay from load in the test case.,
These `.time` gauges are little lambda expressions evaluated whenever the
gauge value is extracted; they divide the total load/merge durations by the
event count. The Timer metrics don't provide enough data here, because they
support various decaying reservoirs/windows for their time values, not
whole-life-of-app durations.
There's some base class metric support for registration (everything is
prefixed) and testability. There's now tests in `HistoryServerSuite."incomplete
apps get refreshed"` for metrics being in the dumped list, and for specific
values, especially averages (that they increase, that they don't trigger
division-by-0 exceptions before there have been any loads)
This is what the metrics look like after the tests (from a log of the
toString value).
The `.last.attempted` values are `System.currentTimeMillis` timestamps of
the operations. I'd considered a "time-since" gauge, but after some offline
discussion with Allen Wittenauer, went for the absolute values; I'll leave to
to the management tooling to work out elapsed times
from absolute values if they want to use that for alerts or UIs.
```
16/07/14 14:17:03.837 ScalaTest-main-running-HistoryServerSuite INFO
HistoryServerSuite: Metrics:
Metrics for history:
Counters
Gauges
Metrics for history.fs:
Counters
history.provider.appui.event.count = 103
history.provider.appui.load.count = 6
history.provider.appui.load.duration = 38244482
history.provider.appui.load.failure.count = 0
history.provider.appui.load.not-found.count = 0
history.provider.history.merge.duration = 16138193
history.provider.history.merge.event.count = 83
history.provider.update.count = 5
history.provider.update.failure.count = 0
Gauges
history.provider.appui.event.replay.time = 371305
history.provider.history.merge.event.time = 194436
history.provider.update.last.attempted = 1468502223726
history.provider.update.last.succeeded = 1468502223000
Metrics for application.cache:
Counters
history.cache.eviction.count = 3
history.cache.load.count = 4
history.cache.lookup.count = 58
history.cache.lookup.failure.count = 0
history.cache.update.probe.count = 57
history.cache.update.triggered.count = 3
Gauges
```
One thing to consider: gauges of the number of complete and incomplete
applications? I know the REST UI gives this, but only indirectly (you call, you
count the size of the lists). Doing it in the metrics provides something that
could be monitored or probed in tests without making REST calls.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]