Github user steveloughran commented on the issue:

    https://github.com/apache/spark/pull/9571
  
    This patch adds separate average values of the load times vs merge times 
per event; this shows ~2x difference in replay from load in the test case.,
    
    These `.time` gauges are little lambda expressions evaluated whenever the 
gauge value is extracted; they divide the total load/merge durations by the 
event count. The Timer metrics don't provide enough data here, because they 
support various decaying reservoirs/windows for their time values, not 
whole-life-of-app durations.
    
    There's some base class metric support for registration (everything is 
prefixed) and testability. There's now tests in `HistoryServerSuite."incomplete 
apps get refreshed"` for metrics being in the dumped list, and for specific 
values, especially averages (that they increase, that they don't trigger 
division-by-0 exceptions before there have been any loads)
    
    This is what the metrics look like after the tests (from a log of the 
toString value). 
    
    The `.last.attempted` values are `System.currentTimeMillis` timestamps of 
the operations. I'd considered a "time-since" gauge, but after some offline 
discussion with Allen Wittenauer, went for the absolute values; I'll leave to 
to the management tooling to work out elapsed times
    from absolute values if they want to use that for alerts or UIs.
    
    ```
    16/07/14 14:17:03.837 ScalaTest-main-running-HistoryServerSuite INFO 
HistoryServerSuite: Metrics:
    Metrics for history:
      Counters
      Gauges
    
    Metrics for history.fs:
      Counters
        history.provider.appui.event.count = 103
        history.provider.appui.load.count = 6
        history.provider.appui.load.duration = 38244482
        history.provider.appui.load.failure.count = 0
        history.provider.appui.load.not-found.count = 0
        history.provider.history.merge.duration = 16138193
        history.provider.history.merge.event.count = 83
        history.provider.update.count = 5
        history.provider.update.failure.count = 0
      Gauges
        history.provider.appui.event.replay.time = 371305
        history.provider.history.merge.event.time = 194436
        history.provider.update.last.attempted = 1468502223726
        history.provider.update.last.succeeded = 1468502223000
    
    Metrics for application.cache:
      Counters
        history.cache.eviction.count = 3
        history.cache.load.count = 4
        history.cache.lookup.count = 58
        history.cache.lookup.failure.count = 0
        history.cache.update.probe.count = 57
        history.cache.update.triggered.count = 3
      Gauges
    ```
    
    One thing to consider: gauges of the number of complete and incomplete 
applications? I know the REST UI gives this, but only indirectly (you call, you 
count the size of the lists). Doing it in the metrics provides something that 
could be monitored or probed in tests without making REST calls.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to