vvraskin commented on a change in pull request #3884: Document metrics 
generated within OpenWhisk
URL: 
https://github.com/apache/incubator-openwhisk/pull/3884#discussion_r203763836
 
 

 ##########
 File path: docs/metrics.md
 ##########
 @@ -75,6 +75,183 @@ The docker image exposes StatsD via the (standard) port 
8125 and a Grafana dashb
 
 The address of your docker host has to be configured in the 
`metrics_kamon_statsd_host` configuration property.
 
+### Metric Names
+
+All metric names have to prefixed by a prefix that you specify and are subject 
to modification by graphite, datadog, or statsd. For example if prefix used is 
`openwhisk` then metric names would be like 
`openwhisk.counter.controller_activation_start`. This document assumes that 
metric name prefix is `openwhisk`
+
+Currently OpenWhisk emits following types of metrics
+
+#### Counter
+
+Counter [record the 
count](http://kamon.io/documentation/0.6.x/kamon-core/metrics/instruments/#counters)
 of metric and there names are prefixed with `openwhisk.counter`. For example 
`openwhisk.counter.controller_activation_start`. Counters just counts and 
resets to zero upon each flush.
+
+#### Histograms
+
+Histogram record the 
[distribution](http://kamon.io/documentation/0.6.x/kamon-core/metrics/instruments/#histograms)
 of given metric and there names are prefixed with `openwhisk.histogram`. For 
example `openwhisk.histogram.controller_activation_finish`. A histogram metrics 
may result in multiple values at the metric aggregator level. For example in 
[Datadog](https://docs.datadoghq.com/developers/metrics/histograms/) for each 
histogram metric following values are record
+
+* `my_metric.avg` - Average of aggregated values during the flush interval.
+* `my_metric.count` - Count of aggregated values during the flush interval.
+* `my_metric.median` - Median of aggregated values during the flush interval.
+* `my_metric.95percentile` - 95th percentile value of aggregated values during 
the flush interval.
+* `my_metric.max` - Max of aggregated values during the flush interval.
+* `my_metric.min` - Min of aggregated values during the flush interval.
+
+### Metric Details
+
+Below are some of the important metrics emitted by OpenWhisk setup
+
+#### Controller metrics
+
+Metrics below are emitted from with a Controller instance.
+
+##### Controller Startup
+
+* `openwhisk.counter.controller_startup<controller_id>_count` (counter)
+  * Example _openwhisk.counter.controller_startup0_count_
+  * Records count of controller instance startup
+
+##### Activation Submission
+
+Following metrics record stats around activation handling within Controller
+
+* Normal actions
+  * `openwhisk.counter.controller_activation_start` (counter) - Records the 
count of non blocking activations started.
+  * `openwhisk.histogram.controller_activation_finish` (histogram) - Records 
the overall time taken for non blocking activation to be submitted to Load 
balancer.
+* Blocking actions
+  * `openwhisk.counter.controller_blockingActivation_start` (counter) - 
Records the count of blocking activations started.
+  * `openwhisk.histogram.controller_blockingActivation_finish` (histogram) - 
Records the time taken for a blocking activation to finish or timeout.
+
+##### Load Balancer
+
+Aggregate metrics for inflight activations.
+
+* `openwhisk.histogram.loadbalancer<controllerId>_activationsInflight_count` 
(histogram) - Records the number of activations being worked upon for a given 
controller. As a histogram it would give a distribution of inflight activation 
count within a flush interval.
+* `openwhisk.histogram.loadbalancer<controllerId>_memoryInflight_count` 
(histogram) - Records the amount of RAM memory in use for in flight 
activations. This is not actual runtime memory but the memory specified per 
action limits.
+
+Metrics below are captured within load balancer
+
+* `openwhisk.counter.loadbalancer_activations_count` (counter) -  Records the 
count of activations sent to Kafka.
+* `openwhisk.counter.controller_kafka_start` (counter) - Records the count of 
activations sent to Kafka.
+* `openwhisk.counter.controller_kafka_error` (counter) - Records the count of 
activations which encountered some failure while submitting to Kafka.
+* `openwhisk.histogram.controller_kafka_finish` (histogram) - Records the time 
taken when activation was successfully submitted to Kafka.
+* `openwhisk.histogram.controller_kafka_error` (histogram) - Records the time 
taken when activation submission to Kafka resulted in failure.
+* `openwhisk.counter.controller_loadbalancer_start` (counter) - Records the 
count of activations submitted to load balancer.
+* `openwhisk.histogram.controller_loadbalancer_finish` (histogram) - Records 
the time taken to submit to load balancer.
+
+Metrics below are for invoker state as recorded within load balancer 
monitoring.
+
+* `openwhisk.counter.loadbalancer_invokerOffline_count` - Records the count of 
invokers considered offline based on health pings.
+* `openwhisk.counter.loadbalancer_invokerUnhealthy_count` - Records the count 
of invokers considered unhealthy based on health pings.
+
+#### Invoker metrics
+
+##### Container Init
+
+* `openwhisk.counter.invoker_activationInit_start` (counter) - Count of 
container initializations done.
+* `openwhisk.histogram.invoker_activationInit_finish` (histogram) - Time taken 
for successful container initializations.
+* `openwhisk.histogram.invoker_activationInit_error` (histogram) - Time taken 
container initialization failed. Count metrics of this histogram would give 
insight on failed initialization count.
+
+##### Container Run
+
+* `openwhisk.counter.invoker_activationRun_start` (counter) - Count of action 
executions performed.
+* `openwhisk.histogram.invoker_activationRun_finish` (histogram) - Time taken 
for action execution for success case.
+* `openwhisk.histogram.invoker_activationRun_error` (histogram) - Time taken 
for action execution for failed cases. Count metrics of this histogram would 
give insight on failed execution count.
+
+##### Container Start
+
+* `openwhisk.counter.invoker_containerStart.cold_count` (counter) - Count of 
number of cold starts.
+* `openwhisk.counter.invoker_containerStart.recreated_count` (counter) - Count 
of number of times container is recreated.
+* `openwhisk.counter.invoker_containerStart.warm_count` (counter) - Count of 
number of times a warm container is used.
+
+##### Log Collection
+
+* `openwhisk.counter.invoker_collectLogs_start` (counter) - Count of number of 
times log were collected.
+* `openwhisk.counter.invoker_collectLogs_error` (counter) - Count of number of 
failed logs collections.
+* `openwhisk.histogram.invoker_collectLogs_error` (histogram) - Time taken for 
failed log collection.
+* `openwhisk.histogram.invoker_collectLogs_finish` (histogram) - Time taken 
for successful log collection.
+
+##### Activation Handling
+
+* `openwhisk.counter.invoker_activation_start` (counter) - Count of 
activations handled
+
+##### Docker Metrics
+
+Following metrics capture stats around various docker command executions.
+
+* Pause
+  * `openwhisk.counter.invoker_docker.pause_start`
+  * `openwhisk.counter.invoker_docker.pause_error`
+  * `openwhisk.histogram.invoker_docker.pause_finish`
+  * `openwhisk.histogram.invoker_docker.pause_error`
+* Ps
+  * `openwhisk.counter.invoker_docker.ps_start`
+  * `openwhisk.counter.invoker_docker.ps_error`
+  * `openwhisk.histogram.invoker_docker.ps_finish`
+  * `openwhisk.histogram.invoker_docker.ps_error`
+* pull
+  * `openwhisk.counter.invoker_docker.pull_start`
+  * `openwhisk.counter.invoker_docker.pull_error`
+  * `openwhisk.histogram.invoker_docker.pull_finish`
+  * `openwhisk.histogram.invoker_docker.pull_error`
+* rm
+  * `openwhisk.counter.invoker_docker.rm_start`
+  * `openwhisk.counter.invoker_docker.rm_error`
+  * `openwhisk.histogram.invoker_docker.rm_finish`
+  * `openwhisk.histogram.invoker_docker.rm_error`
+* run
+  * `openwhisk.counter.invoker_docker.run_start`
+  * `openwhisk.counter.invoker_docker.run_error`
+  * `openwhisk.histogram.invoker_docker.run_finish`
+  * `openwhisk.histogram.invoker_docker.run_error`
+* unpause
+  * `openwhisk.counter.invoker_docker.unpause_start`
+  * `openwhisk.counter.invoker_docker.unpause_error`
+  * `openwhisk.histogram.invoker_docker.unpause_finish`
+  * `openwhisk.histogram.invoker_docker.unpause_error`
+
+#### Kafka Metrics
+
+Metrics below are emitted per kafka topic.
+
+* `openwhisk.histogram.kafka_<topic name>.delay_start` - Time delay between 
when a message was pushed to kafka and when it is read within a consumer.
 
 Review comment:
   should we also mention that Delay is being emitted for each pool by Invoker, 
while Queue metric is emitted every 10 seconds? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to