[GitHub] storm pull request #2845: STORM-3234: Replace old metrics docs with better d...

govind-menon Tue, 25 Sep 2018 07:40:27 -0700

Github user govind-menon commented on a diff in the pull request:

    https://github.com/apache/storm/pull/2845#discussion_r220219980
  
    --- Diff: docs/ClusterMetrics.md ---
    @@ -0,0 +1,256 @@
    +---
    +title: Cluster Metrics
    +layout: documentation
    +documentation: true
    +---
    +
    +#Cluster Metrics
    +
    +There are lots of metrics to help you monitor a running cluster.  Many of 
these metrics are still a work in progress and so is the metrics system itself 
so any of them may change, even between minor version releases.  We will try to 
keep them as stable as possible, but they should all be considered somewhat 
unstable. Some of the metrics may also be for experimental features, or 
features that are not complete yet, so please read the description of the 
metric before using it for monitoring or alerting.
    +
    +Also be aware that depending on the metrics system you use, the names are 
likely to be translated into a different format that is compatible with the 
system.  Typically this means that the ':' separating character will be 
replaced with a '.' character.
    +
    +Most metrics should have the units that they are reported in as a part of 
the description.  For Timers often this is configured by the reporter that is 
uploading them to your system.  Pay attention because even if the metric name 
has a time unit in it, it may be false.
    +
    +Also most metrics, except for gauges and counters, are a collection of 
numbers, and not a single value.  Often these result in multiple metrics being 
uploaded to a reporting system, such as percentiles for a histogram, or rates 
for a meter.  It is dependent on the configured metrics reporter how this 
happens, or how the name here corresponds to the metric in your reporting 
system.
    +
    +## Cluster Metrics (From Nimbus)
    +
    +These are metrics that come from the active nimbus instance and report the 
state of the cluster as a whole, as seen by nimbus.
    +
    +| Metric Name | Type | Description |
    +|-------------|------|-------------|
    +| cluster:num-nimbus-leaders | gauge | Number of nimbuses marked as a 
leader. This should really only ever be 1 in a health cluster, or 0 for a short 
period of time while a failover happens. |
    +| cluster:num-nimbuses | gauge | Number of nimbuses, leader or standby. |
    +| cluster:num-supervisors | gauge | Number of supervisors. |
    +| cluster:num-topologies | gauge | Number of topologies. |
    +| cluster:num-total-used-workers | gauge | Number of used workers/slots. |
    +| cluster:num-total-workers | gauge | Number of workers/slots. |
    +| cluster:total-fragmented-cpu-non-negative | gauge | Total fragmented CPU 
(% of core).  This is CPU that the system thinks it cannot use because other 
resources on the node are used up. |
    +| cluster:total-fragmented-memory-non-negative | gauge | Total fragmented 
memory (MB).  This is memory that the system thinks it cannot use because other 
resources on the node are used up.  |
    +| topologies:assigned-cpu | histogram | CPU scheduled per topology (% of a 
core) |
    +| topologies:assigned-mem-off-heap | histogram | Off heap memory scheduled 
per topology (MB) |
    +| topologies:assigned-mem-on-heap | histogram | On heap memory scheduled 
per topology (MB) |
    +| topologies:num-executors | histogram | Number of executors per topology. 
|
    +| topologies:num-tasks | histogram | Number of tasks per topology. |
    +| topologies:num-workers | histogram | Number of workers per topology. |
    +| topologies:replication-count | histogram | Replication count per 
topology. |
    +| topologies:requested-cpu | histogram | CPU requested per topology  (% of 
a core). |
    +| topologies:requested-mem-off-heap | histogram | Off heap memory 
requested per topology (MB). |
    +| topologies:requested-mem-on-heap | histogram | On heap memory requested 
per topology (MB). |
    +| topologies:uptime-secs | histogram | Uptime per topology (seconds). |
    +| nimbus:available-cpu-non-negative | gauge | Available cpu on the cluster 
(% of a core). |
    +| nimbus:total-cpu | gauge | total CPU on the cluster (% of a core) |
    +| nimbus:total-memory | gauge | total memory on the cluster MB |
    +| supervisors:fragmented-cpu | histogram | fragmented cpu per supervisor 
(% of a core) |
    +| supervisors:fragmented-mem | histogram | fragmented memory per 
supervisor (MB) |
    +| supervisors:num-used-workers | histogram | workers used per supervisor |
    +| supervisors:num-workers | histogram | number of workers per supervisor |
    +| supervisors:uptime-secs | histogram | uptime of supervisors |
    +| supervisors:used-cpu | histogram | cpu used per supervisor (% of a core) 
|
    +| supervisors:used-mem | histogram | memory used per supervisor MB |
    +
    +## Nimbus Metrics
    +
    +These are metrics that are specific to a nimbus instance.  In many 
instances only the active nimbus will be reporting these metrics, but they 
could come from standby nimbus instances as well.
    +
    +| Metric Name | Type | Description |
    +|-------------|------|-------------|
    +| nimbus:files-upload-duration-ms | timer | Time it takes to upload a file 
from start to finish (Not Blobs, but this may change) |
    +| nimbus:longest-scheduling-time-ms | gauge | Longest time ever taken so 
far to schedule. This includes the current scheduling run, which is intended to 
detect if scheduling is stuck for some reason. |
    +| nimbus:num-activate-calls | meter | calls to the activate thrift method. 
|
    +| nimbus:num-added-executors-per-scheduling | histogram | number of 
executors added after a scheduling run. |
    +| nimbus:num-added-slots-per-scheduling | histogram |  number of slots 
added after a scheduling run. |
    +| nimbus:num-beginFileUpload-calls | meter | calls to the beginFileUpload 
thrift method. |
    +| nimbus:num-blacklisted-supervisor | gauge | Number of supervisors 
currently marked as blacklisted because they appear to be somewhat unstable. |
    +| nimbus:num-deactivate-calls | meter | calls to deactivate thrift method. 
|
    +| nimbus:num-debug-calls | meter | calls to debug thrift method.|
    +| nimbus:num-downloadChunk-calls | meter | calls to downloadChunk thrift 
method. |
    +| nimbus:num-finishFileUpload-calls | meter | calls to finishFileUpload 
thrift method.|
    +| nimbus:num-gained-leadership | meter | number of times this nimbus 
gained leadership. |
    +| nimbus:num-getClusterInfo-calls | meter | calls to getClusterInfo thrift 
method. |
    +| nimbus:num-getComponentPageInfo-calls | meter | calls to 
getComponentPageInfo thrift method. |
    +| nimbus:num-getComponentPendingProfileActions-calls | meter | calls to 
getComponentPendingProfileActions thrift method. |
    +| nimbus:num-getLeader-calls | meter | calls to getLeader thrift method. |
    +| nimbus:num-getLogConfig-calls | meter | calls to getLogConfig thrift 
method. |
    +| nimbus:num-getNimbusConf-calls | meter | calls to getNimbusConf thrift 
method. |
    +| nimbus:num-getOwnerResourceSummaries-calls | meter | calls to 
getOwnerResourceSummaries thrift method. |
    +| nimbus:num-getSupervisorPageInfo-calls | meter | calls to 
getSupervisorPageInfo thrift method. |
    +| nimbus:num-getTopology-calls | meter | calls to getTopology thrift 
method. |
    +| nimbus:num-getTopologyConf-calls | meter | calls to getTopologyConf 
thrift method. |
    +| nimbus:num-getTopologyInfo-calls | meter | calls to getTopologyInfo 
thrift method. |
    +| nimbus:num-getTopologyInfoWithOpts-calls | meter | calls to 
getTopologyInfoWithOpts thrift method includes calls to getTopologyInfo. |
    +| nimbus:num-getTopologyPageInfo-calls | meter | calls to 
getTopologyPageInfo thrift method. |
    +| nimbus:num-getUserTopology-calls | meter | calls to getUserTopology 
thrift method. |
    +| nimbus:num-isTopologyNameAllowed-calls | meter | calls to 
isTopologyNameAllowed thrift method. |
    +| nimbus:num-killTopology-calls | meter | calls to killTopology thrift 
method. |
    +| nimbus:num-killTopologyWithOpts-calls | meter | calls to 
killTopologyWithOpts thrift method includes calls to killTopology. |
    +| nimbus:num-launched | meter | number of times a nimbus was launched |
    +| nimbus:num-lost-leadership | meter | number of times this nimbus lost 
leadership |
    +| nimbus:num-negative-resource-events | meter | any time a resource goes 
negative (either CPU or Memory)  Not consistent as it is used for internal 
calculations that may go negative and does not represent over scheduling of 
resources. |
    --- End diff --
    
    Do you mean "not inconsistent".
    
    Nit: missing a semicolon or .

---

[GitHub] storm pull request #2845: STORM-3234: Replace old metrics docs with better d...

Reply via email to