vvraskin commented on a change in pull request #3884: Document metrics generated within OpenWhisk URL: https://github.com/apache/incubator-openwhisk/pull/3884#discussion_r203763836
########## File path: docs/metrics.md ########## @@ -75,6 +75,183 @@ The docker image exposes StatsD via the (standard) port 8125 and a Grafana dashb The address of your docker host has to be configured in the `metrics_kamon_statsd_host` configuration property. +### Metric Names + +All metric names have to prefixed by a prefix that you specify and are subject to modification by graphite, datadog, or statsd. For example if prefix used is `openwhisk` then metric names would be like `openwhisk.counter.controller_activation_start`. This document assumes that metric name prefix is `openwhisk` + +Currently OpenWhisk emits following types of metrics + +#### Counter + +Counter [record the count](http://kamon.io/documentation/0.6.x/kamon-core/metrics/instruments/#counters) of metric and there names are prefixed with `openwhisk.counter`. For example `openwhisk.counter.controller_activation_start`. Counters just counts and resets to zero upon each flush. + +#### Histograms + +Histogram record the [distribution](http://kamon.io/documentation/0.6.x/kamon-core/metrics/instruments/#histograms) of given metric and there names are prefixed with `openwhisk.histogram`. For example `openwhisk.histogram.controller_activation_finish`. A histogram metrics may result in multiple values at the metric aggregator level. For example in [Datadog](https://docs.datadoghq.com/developers/metrics/histograms/) for each histogram metric following values are record + +* `my_metric.avg` - Average of aggregated values during the flush interval. +* `my_metric.count` - Count of aggregated values during the flush interval. +* `my_metric.median` - Median of aggregated values during the flush interval. +* `my_metric.95percentile` - 95th percentile value of aggregated values during the flush interval. +* `my_metric.max` - Max of aggregated values during the flush interval. +* `my_metric.min` - Min of aggregated values during the flush interval. + +### Metric Details + +Below are some of the important metrics emitted by OpenWhisk setup + +#### Controller metrics + +Metrics below are emitted from with a Controller instance. + +##### Controller Startup + +* `openwhisk.counter.controller_startup<controller_id>_count` (counter) + * Example _openwhisk.counter.controller_startup0_count_ + * Records count of controller instance startup + +##### Activation Submission + +Following metrics record stats around activation handling within Controller + +* Normal actions + * `openwhisk.counter.controller_activation_start` (counter) - Records the count of non blocking activations started. + * `openwhisk.histogram.controller_activation_finish` (histogram) - Records the overall time taken for non blocking activation to be submitted to Load balancer. +* Blocking actions + * `openwhisk.counter.controller_blockingActivation_start` (counter) - Records the count of blocking activations started. + * `openwhisk.histogram.controller_blockingActivation_finish` (histogram) - Records the time taken for a blocking activation to finish or timeout. + +##### Load Balancer + +Aggregate metrics for inflight activations. + +* `openwhisk.histogram.loadbalancer<controllerId>_activationsInflight_count` (histogram) - Records the number of activations being worked upon for a given controller. As a histogram it would give a distribution of inflight activation count within a flush interval. +* `openwhisk.histogram.loadbalancer<controllerId>_memoryInflight_count` (histogram) - Records the amount of RAM memory in use for in flight activations. This is not actual runtime memory but the memory specified per action limits. + +Metrics below are captured within load balancer + +* `openwhisk.counter.loadbalancer_activations_count` (counter) - Records the count of activations sent to Kafka. +* `openwhisk.counter.controller_kafka_start` (counter) - Records the count of activations sent to Kafka. +* `openwhisk.counter.controller_kafka_error` (counter) - Records the count of activations which encountered some failure while submitting to Kafka. +* `openwhisk.histogram.controller_kafka_finish` (histogram) - Records the time taken when activation was successfully submitted to Kafka. +* `openwhisk.histogram.controller_kafka_error` (histogram) - Records the time taken when activation submission to Kafka resulted in failure. +* `openwhisk.counter.controller_loadbalancer_start` (counter) - Records the count of activations submitted to load balancer. +* `openwhisk.histogram.controller_loadbalancer_finish` (histogram) - Records the time taken to submit to load balancer. + +Metrics below are for invoker state as recorded within load balancer monitoring. + +* `openwhisk.counter.loadbalancer_invokerOffline_count` - Records the count of invokers considered offline based on health pings. +* `openwhisk.counter.loadbalancer_invokerUnhealthy_count` - Records the count of invokers considered unhealthy based on health pings. + +#### Invoker metrics + +##### Container Init + +* `openwhisk.counter.invoker_activationInit_start` (counter) - Count of container initializations done. +* `openwhisk.histogram.invoker_activationInit_finish` (histogram) - Time taken for successful container initializations. +* `openwhisk.histogram.invoker_activationInit_error` (histogram) - Time taken container initialization failed. Count metrics of this histogram would give insight on failed initialization count. + +##### Container Run + +* `openwhisk.counter.invoker_activationRun_start` (counter) - Count of action executions performed. +* `openwhisk.histogram.invoker_activationRun_finish` (histogram) - Time taken for action execution for success case. +* `openwhisk.histogram.invoker_activationRun_error` (histogram) - Time taken for action execution for failed cases. Count metrics of this histogram would give insight on failed execution count. + +##### Container Start + +* `openwhisk.counter.invoker_containerStart.cold_count` (counter) - Count of number of cold starts. +* `openwhisk.counter.invoker_containerStart.recreated_count` (counter) - Count of number of times container is recreated. +* `openwhisk.counter.invoker_containerStart.warm_count` (counter) - Count of number of times a warm container is used. + +##### Log Collection + +* `openwhisk.counter.invoker_collectLogs_start` (counter) - Count of number of times log were collected. +* `openwhisk.counter.invoker_collectLogs_error` (counter) - Count of number of failed logs collections. +* `openwhisk.histogram.invoker_collectLogs_error` (histogram) - Time taken for failed log collection. +* `openwhisk.histogram.invoker_collectLogs_finish` (histogram) - Time taken for successful log collection. + +##### Activation Handling + +* `openwhisk.counter.invoker_activation_start` (counter) - Count of activations handled + +##### Docker Metrics + +Following metrics capture stats around various docker command executions. + +* Pause + * `openwhisk.counter.invoker_docker.pause_start` + * `openwhisk.counter.invoker_docker.pause_error` + * `openwhisk.histogram.invoker_docker.pause_finish` + * `openwhisk.histogram.invoker_docker.pause_error` +* Ps + * `openwhisk.counter.invoker_docker.ps_start` + * `openwhisk.counter.invoker_docker.ps_error` + * `openwhisk.histogram.invoker_docker.ps_finish` + * `openwhisk.histogram.invoker_docker.ps_error` +* pull + * `openwhisk.counter.invoker_docker.pull_start` + * `openwhisk.counter.invoker_docker.pull_error` + * `openwhisk.histogram.invoker_docker.pull_finish` + * `openwhisk.histogram.invoker_docker.pull_error` +* rm + * `openwhisk.counter.invoker_docker.rm_start` + * `openwhisk.counter.invoker_docker.rm_error` + * `openwhisk.histogram.invoker_docker.rm_finish` + * `openwhisk.histogram.invoker_docker.rm_error` +* run + * `openwhisk.counter.invoker_docker.run_start` + * `openwhisk.counter.invoker_docker.run_error` + * `openwhisk.histogram.invoker_docker.run_finish` + * `openwhisk.histogram.invoker_docker.run_error` +* unpause + * `openwhisk.counter.invoker_docker.unpause_start` + * `openwhisk.counter.invoker_docker.unpause_error` + * `openwhisk.histogram.invoker_docker.unpause_finish` + * `openwhisk.histogram.invoker_docker.unpause_error` + +#### Kafka Metrics + +Metrics below are emitted per kafka topic. + +* `openwhisk.histogram.kafka_<topic name>.delay_start` - Time delay between when a message was pushed to kafka and when it is read within a consumer. Review comment: should we also mention that Delay is being emitted for each pool by Invoker, while Queue metric is emitted every 10 seconds? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services