This is an automated email from the ASF dual-hosted git repository.
yuchaoran pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-yunikorn-site.git
The following commit(s) were added to refs/heads/master by this push:
new be96059 [YUNIKORN-925]Add document to explain existing yunikorn
metrics (#94)
be96059 is described below
commit be96059df4c40a926e59696f997a70ba9fea6d40
Author: Tingyao Huang <[email protected]>
AuthorDate: Mon Nov 22 14:59:22 2021 +0800
[YUNIKORN-925]Add document to explain existing yunikorn metrics (#94)
---
docs/performance/metrics.md | 39 ++++++++++++++++++++++++++++++++++++++-
1 file changed, 38 insertions(+), 1 deletion(-)
diff --git a/docs/performance/metrics.md b/docs/performance/metrics.md
index d7ebfa4..9dedbec 100644
--- a/docs/performance/metrics.md
+++ b/docs/performance/metrics.md
@@ -25,13 +25,50 @@ under the License.
-->
YuniKorn leverages [Prometheus](https://prometheus.io/) to record metrics. The
metrics system keeps tracking of
-scheduler's critical execution paths, to reveal potential performance
bottlenecks. Currently, there are two categories
+scheduler's critical execution paths, to reveal potential performance
bottlenecks. Currently, there are three categories
for these metrics:
- scheduler: generic metrics of the scheduler, such as allocation latency, num
of apps etc.
- queue: each queue has its own metrics sub-system, tracking queue status.
+- event: record various changes of events in YuniKorn.
all metrics are declared in `yunikorn` namespace.
+### Scheduler Metrics
+
+| Metrics Name | Metrics Type | Description |
+| --------------------- | ------------ | ------------ |
+| containerAllocation | Counter | Total number of attempts to allocate
containers. State of the attempt includes `allocated`, `rejected`, `error`,
`released`. Increase only. |
+| applicationSubmission | Counter | Total number of application
submissions. State of the attempt includes `accepted` and `rejected`. Increase
only. |
+| applicationStatus | Gauge | Total number of application status.
State of the application includes `running` and `completed`. |
+| totalNodeActive | Gauge | Total number of active nodes.
|
+| totalNodeFailed | Gauge | Total number of failed nodes.
|
+| nodeResourceUsage | Gauge | Total resource usage of node, by
resource name. |
+| schedulingLatency | Histogram | Latency of the main scheduling
routine, in seconds. |
+| nodeSortingLatency | Histogram | Latency of all nodes sorting, in
seconds. |
+| appSortingLatency | Histogram | Latency of all applications sorting,
in seconds. |
+| queueSortingLatency | Histogram | Latency of all queues sorting, in
seconds. |
+| tryNodeLatency | Histogram | Latency of node condition checks for
container allocations, such as placement constraints, in seconds, in seconds. |
+
+### Queue Metrics
+
+| Metrics Name | Metrics Type | Description |
+| ------------------------- | ------------- | ----------- |
+| appMetrics | Counter | Application Metrics, record the
total number of applications. State of the application includes
`accepted`,`rejected` and `Completed`. |
+| usedResourceMetrics | Gauge | Queue used resource. |
+| pendingResourceMetrics | Gauge | Queue pending resource. |
+| availableResourceMetrics | Gauge | Used resource metrics related to
queues etc. |
+
+### Event Metrics
+
+| Metrics Name | Metrics Type | Description |
+| ------------------------ | ------------ | ----------- |
+| totalEventsCreated | Gauge | Total events created. |
+| totalEventsChanneled | Gauge | Total events channeled. |
+| totalEventsNotChanneled | Gauge | Total events not channeled. |
+| totalEventsProcessed | Gauge | Total events processed. |
+| totalEventsStored | Gauge | Total events stored. |
+| totalEventsNotStored | Gauge | Total events not stored. |
+| totalEventsCollected | Gauge | Total events collected. |
## Access Metrics