(yunikorn-site) branch master updated: [YUNIKORN-1626] Listing Yunikorn metrics revealed in the prometheus (#330)

samhxwu Thu, 02 Nov 2023 19:22:36 -0700

This is an automated email from the ASF dual-hosted git repository.

samhxwu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/yunikorn-site.git



The following commit(s) were added to refs/heads/master by this push:
     new 0da28548b4 [YUNIKORN-1626] Listing Yunikorn metrics revealed in the 
prometheus (#330)
0da28548b4 is described below

commit 0da28548b43c336633840d884ba7ef08a789d13e
Author: YuTeng Chen <[email protected]>
AuthorDate: Fri Nov 3 10:21:54 2023 +0800

    [YUNIKORN-1626] Listing Yunikorn metrics revealed in the prometheus (#330)
    
    Signed-off-by: Hsuan Zong Wu <[email protected]>
---
 docs/metrics/queue.md      |  63 +++++++++
 docs/metrics/runtime.md    |  39 ++++++
 docs/metrics/scheduler.mdx | 314 +++++++++++++++++++++++++++++++++++++++++++++
 sidebars.js                |   9 ++
 4 files changed, 425 insertions(+)

diff --git a/docs/metrics/queue.md b/docs/metrics/queue.md
new file mode 100644
index 0000000000..0aa1f7cd36
--- /dev/null
+++ b/docs/metrics/queue.md
@@ -0,0 +1,63 @@
+---
+id: queue
+title: Queue
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Queue
+### Application
+Eech queue has a `<queue_name> queue_app` metric to trace the applications in 
the queue.
+`<queue_name> queue_app` metrics records the number of applications in 
different states.
+These application states include `running`, `accepted`, `rejected`, `failed` 
and `completed`. 
+`<queue_name> queue_app` metrics record container states including `released`, 
`allocated`. 
+**Metric Type**: `guage`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `<queue name>`
+
+**TYPE**: `yunikorn_<queue name>_queue_app`
+
+```json
+yunikorn_root_default_queue_app{state="accepted"} 3
+yunikorn_root_default_queue_app{state="running"} 3
+```
+
+### Resource
+The `<queue_name> queue_resource` metric to trace the resource in the queue.
+These resource states include `guaranteed`, `max`, `allocated`, `pending`, 
`preempting`.
+
+**Metric Type**: `guage`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `<queue name>`
+
+**TYPE**: `yunikorn_<queue name>_queue_resource`
+
+```json
+yunikorn_root_queue_resource{resource="ephemeral-storage",state="max"} 
9.41009558e+10
+yunikorn_root_queue_resource{resource="hugepages-1Gi",state="max"} 0
+yunikorn_root_queue_resource{resource="hugepages-2Mi",state="max"} 0
+yunikorn_root_queue_resource{resource="memory",state="max"} 1.6223076352e+10
+yunikorn_root_queue_resource{resource="pods",state="max"} 110
+yunikorn_root_queue_resource{resource="vcore",state="max"} 8000
+```
diff --git a/docs/metrics/runtime.md b/docs/metrics/runtime.md
new file mode 100644
index 0000000000..0f787df25b
--- /dev/null
+++ b/docs/metrics/runtime.md
@@ -0,0 +1,39 @@
+---
+id: runtime
+title: Runtime
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## MemStats
+| Yunikorn Metric         | Runtime MemStats                                  
| Metric Type |
+|-------------------------|---------------------------------------------------|-----------------|
+| go_mem_stats            | `Alloc`,`TotalAlloc`, `Sys`, `HeapIdle` and so on 
| `guage`         |
+| go_pause_ns             | `PauseNs`                                         
| `guage`         |
+| go_pause_end            | `PauseEnd`                                        
| `guage`         |
+| go_alloc_bysize_maxsize | `BySize.Size`                                     
| `histogram`     |
+| go_alloc_bysize_free    | `BySize.Frees`                                    
| `histogram`     |
+| go_alloc_bysize_malloc  | `BySize.Mallocs`                                  
| `histogram`     |
+
+## Generic
+The `go_generic` metric includes  descriptions of supported metrics
+in the `runtime/metrics` package.
+
+**Metric Type**: `guage`
diff --git a/docs/metrics/scheduler.mdx b/docs/metrics/scheduler.mdx
new file mode 100644
index 0000000000..cc186e203b
--- /dev/null
+++ b/docs/metrics/scheduler.mdx
@@ -0,0 +1,314 @@
+---
+id: scheduler
+title: Scheduler
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+```mdx-code-block
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+```
+
+## Container
+### Total allocation attempt
+Total number of attempts to allocate containers.
+State of the attempt includes `allocated`, `rejected`, `error`, `released`.
+
+**Metric Type**: `counter`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```json
+yunikorn_scheduler_container_allocation_attempt_total{state="allocated"} 0
+yunikorn_scheduler_container_allocation_attempt_total{state="error"} 0
+yunikorn_scheduler_container_allocation_attempt_total{state="released"} 0
+```
+
+## Application
+### Total
+Total number of applications.
+State of the application includes `running`, `failed` and `completed`.
+
+**Metric Type**: `gauge`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+```
+yunikorn_scheduler_application_total{state="running"} 0
+```
+
+### Total Submission
+Total number of application submissions.
+State of the attempt includes `accepted` and `rejected`.
+
+**Metric Type**: `counter`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+```
+yunikorn_scheduler_application_submission_total{result="accepted"} 6
+```
+
+## Latency
+### Scheduling latency
+Latency of the main scheduling routine, in milliseconds.
+This metric includes latencies, such as `Node sorting`, `Trynode` and 
`Trypreemption`. 
+
+**Metric Type**: `histogram`
+
+**Interval**: `millisecond`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```json
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="0.0001"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="0.001"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="0.01"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="0.1"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="1"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="10"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_bucket{le="+Inf"} 0
+yunikorn_scheduler_scheduling_latency_milliseconds_sum 0
+yunikorn_scheduler_scheduling_latency_milliseconds_count 0
+```
+
+### Node sorting
+Latencies including `node sorting`, `application sorting` and `queue sorting`, 
in milliseconds.
+
+**Metric Type**: `histogram`
+
+**Interval**: `millisecond`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```mdx-code-block
+<Tabs
+  defaultValue="node_sorting"
+  values={[
+    { label: 'Node sorting', value: 'node_sorting'},
+    { label: 'App sorting', value: 'app_sorting'},
+    { label: 'Queue sorting', value: 'queue_sorting'},
+  ]}>
+<TabItem value="app_sorting">
+
+  ```json
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="0.0001"}
 5
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="0.001"}
 6
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="0.01"}
 6
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="0.1"}
 6
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="1"} 
6
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="10"}
 6
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="app",le="+Inf"}
 6
+  yunikorn_scheduler_node_sorting_latency_milliseconds_sum{level="app"} 
0.00026345400000000004
+  yunikorn_scheduler_node_sorting_latency_milliseconds_count{level="app"} 6
+  ```
+
+</TabItem>
+<TabItem value="node_sorting">
+
+  ```json
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="0.0001"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="0.001"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="0.01"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="0.1"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="1"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="10"}
 3
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="node",le="+Inf"}
 3
+  yunikorn_scheduler_node_sorting_latency_milliseconds_sum{level="node"} 
2.5013999999999998e-05
+  yunikorn_scheduler_node_sorting_latency_milliseconds_count{level="node"} 3
+  ```
+
+</TabItem>
+<TabItem value="queue_sorting">
+
+  ```json
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="0.0001"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="0.001"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="0.01"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="0.1"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="1"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="10"}
 9
+  
yunikorn_scheduler_node_sorting_latency_milliseconds_bucket{level="queue",le="+Inf"}
 9
+  yunikorn_scheduler_node_sorting_latency_milliseconds_sum{level="queue"} 
4.0093e-05
+  yunikorn_scheduler_node_sorting_latency_milliseconds_count{level="queue"} 9
+  ```
+
+</TabItem>
+</Tabs>
+
+### Trynode
+Latency of node condition checks for container allocations, such as placement 
constraints, in milliseconds.
+
+**Metric Type**: `histogram`
+
+**Interval**: `millisecond`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```json
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="0.0001"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="0.001"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="0.01"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="0.1"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="1"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="10"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_bucket{le="+Inf"} 0
+yunikorn_scheduler_trynode_latency_milliseconds_sum 0
+yunikorn_scheduler_trynode_latency_milliseconds_count 0
+```
+
+### Trypreemption
+Latency of preemption condition checks for container allocations, in 
milliseconds
+
+**Metric Type**: `histogram`
+
+**Interval**: `millisecond`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```json
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="0.0001"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="0.001"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="0.01"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="0.1"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="1"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="10"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_bucket{le="+Inf"} 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_sum 0
+yunikorn_scheduler_trypreemption_latency_milliseconds_count 0
+```
+## Node
+### Node
+Total number of nodes.
+State of the node includes `active` and `failed`.
+
+**Metric Type**: `gauge`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```json
+yunikorn_scheduler_node{state="active"} 1
+yunikorn_scheduler_node{state="failed"} 0
+```
+
+### Total node usage
+`yunikorn_scheduler_<resource type>_node_usage_total`
+Total resource usage of node, by resource name.
+
+**Metric Type**: `gauge`
+
+**Namespace**: `yunikorn`
+
+**Subsystem**: `scheduler`
+
+```mdx-code-block
+<Tabs
+  defaultValue="ephemeral_storage"
+  values={[
+    { label: 'Ephemeral_storage', value: 'ephemeral_storage'},
+    { label: 'Memory', value: 'memory'},
+    { label: 'Pods', value: 'pods'},
+    { label: 'vcore', value: 'vcore'},
+  ]}>
+<TabItem value="ephemeral_storage">
+
+  ```json
+  yunikorn_scheduler_ephemeral_storage_node_usage_total
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(10%, 20%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(20%,30%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(30%,40%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(40%,50%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(50%,60%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(60%,70%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(70%,80%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(80%,90%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="(90%,100%]"} 0
+  yunikorn_scheduler_ephemeral_storage_node_usage_total{range="[0,10%]"} 1
+  ```
+
+</TabItem>
+<TabItem value="memory">
+
+  ```json
+  yunikorn_scheduler_memory_node_usage_total
+  yunikorn_scheduler_memory_node_usage_total{range="(10%, 20%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(20%,30%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(30%,40%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(40%,50%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(50%,60%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(60%,70%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(70%,80%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(80%,90%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="(90%,100%]"} 0
+  yunikorn_scheduler_memory_node_usage_total{range="[0,10%]"} 1
+  ```
+
+</TabItem>
+<TabItem value="pods">
+
+  ```json
+  yunikorn_scheduler_pods_node_usage_total
+  yunikorn_scheduler_pods_node_usage_total{range="(10%, 20%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(20%,30%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(30%,40%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(40%,50%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(50%,60%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(60%,70%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(70%,80%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(80%,90%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="(90%,100%]"} 0
+  yunikorn_scheduler_pods_node_usage_total{range="[0,10%]"} 1
+  ```
+
+</TabItem>
+<TabItem value="vcore">
+
+  ```json
+  yunikorn_scheduler_vcore_node_usage_total
+  yunikorn_scheduler_vcore_node_usage_total{range="(10%, 20%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(20%,30%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(30%,40%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(40%,50%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(50%,60%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(60%,70%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(70%,80%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(80%,90%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="(90%,100%]"} 0
+  yunikorn_scheduler_vcore_node_usage_total{range="[0,10%]"} 1
+  ```
+  
+</TabItem>
+</Tabs>
+```
diff --git a/sidebars.js b/sidebars.js
index fd60b75f1e..fc859fa4ea 100644
--- a/sidebars.js
+++ b/sidebars.js
@@ -56,6 +56,15 @@ module.exports = {
                     'api/system'
                 ]
             },
+            {
+                type: 'category',
+                label: 'Metrics for Prometheus',
+                items: [
+                    'metrics/scheduler',
+                    'metrics/runtime',
+                    'metrics/queue',
+                ]
+            },
             'user_guide/troubleshooting'
         ],
         'Developer Guide': [


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(yunikorn-site) branch master updated: [YUNIKORN-1626] Listing Yunikorn metrics revealed in the prometheus (#330)

Reply via email to