Github user govind-menon commented on a diff in the pull request:
https://github.com/apache/storm/pull/2845#discussion_r220219222
--- Diff: docs/ClusterMetrics.md ---
@@ -0,0 +1,256 @@
+---
+title: Cluster Metrics
+layout: documentation
+documentation: true
+---
+
+#Cluster Metrics
+
+There are lots of metrics to help you monitor a running cluster. Many of
these metrics are still a work in progress and so is the metrics system itself
so any of them may change, even between minor version releases. We will try to
keep them as stable as possible, but they should all be considered somewhat
unstable. Some of the metrics may also be for experimental features, or
features that are not complete yet, so please read the description of the
metric before using it for monitoring or alerting.
+
+Also be aware that depending on the metrics system you use, the names are
likely to be translated into a different format that is compatible with the
system. Typically this means that the ':' separating character will be
replaced with a '.' character.
+
+Most metrics should have the units that they are reported in as a part of
the description. For Timers often this is configured by the reporter that is
uploading them to your system. Pay attention because even if the metric name
has a time unit in it, it may be false.
+
+Also most metrics, except for gauges and counters, are a collection of
numbers, and not a single value. Often these result in multiple metrics being
uploaded to a reporting system, such as percentiles for a histogram, or rates
for a meter. It is dependent on the configured metrics reporter how this
happens, or how the name here corresponds to the metric in your reporting
system.
+
+## Cluster Metrics (From Nimbus)
+
+These are metrics that come from the active nimbus instance and report the
state of the cluster as a whole, as seen by nimbus.
+
+| Metric Name | Type | Description |
+|-------------|------|-------------|
+| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a
leader. This should really only ever be 1 in a health cluster, or 0 for a short
period of time while a failover happens. |
--- End diff --
Nit: healthy
---