This is an automated email from the ASF dual-hosted git repository.

gongchao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hertzbeat.git


The following commit(s) were added to refs/heads/master by this push:
     new cb11f11ac [feature] add NVIDIA monitor (#2643)
cb11f11ac is described below

commit cb11f11ace5ceae9fce3ed67a4193c24d3771833
Author: Jast <[email protected]>
AuthorDate: Tue Sep 3 23:38:53 2024 +0800

    [feature] add NVIDIA monitor (#2643)
    
    Co-authored-by: aias00 <[email protected]>
    Co-authored-by: shown <[email protected]>
    Co-authored-by: Calvin <[email protected]>
    Co-authored-by: tomsun28 <[email protected]>
---
 home/docs/help/nvidia.md                           |  37 ++++
 .../current/help/nvidia.md                         |  37 ++++
 manager/src/main/resources/define/app-nvidia.yml   | 198 +++++++++++++++++++++
 3 files changed, 272 insertions(+)

diff --git a/home/docs/help/nvidia.md b/home/docs/help/nvidia.md
new file mode 100644
index 000000000..6cc3f0d2f
--- /dev/null
+++ b/home/docs/help/nvidia.md
@@ -0,0 +1,37 @@
+---
+id: nvidia  
+title: NVIDIA Monitoring  
+sidebar_label: NVIDIA Monitoring  
+keywords: [Open Source Monitoring System, NVIDIA Monitoring]
+---
+
+> Collect and monitor general performance metrics of NVIDIA operating systems.
+> NVIDIA monitoring requires the nvidia-smi command, which is installed 
together with the NVIDIA GPU driver. So when monitoring NVIDIA, we need to 
install the NVIDIA GPU driver.
+
+### Configuration Parameters
+
+| Parameter Name   | Description                                               
  |
+|------------------|-------------------------------------------------------------|
+| Monitoring Host  | The IP address (IPv4/IPv6) or domain name of the 
monitored endpoint. Note ⚠️ do not include protocol headers (e.g., https://, 
http://). |
+| Task Name        | The name identifying this monitoring task, which needs to 
be unique. |
+| Port             | The port exposed for Linux SSH, default is 22.            
   |
+| Username         | SSH connection username, optional.                        
   |
+| Password         | SSH connection password, optional.                        
   |
+| Collection Interval | Interval for periodically collecting monitoring data, 
in seconds. The minimum interval is 30 seconds. |
+| Probe Before Monitoring | Whether to probe the monitoring endpoint to check 
its availability before adding it. Monitoring is added or modified only if the 
probe succeeds. |
+| Description/Remarks | Additional notes and descriptions for this monitoring 
task. Users can add relevant information here. |
+
+### Collected Metrics
+
+#### Metric Set: basic
+
+| Metric Name            | Unit   | Description      |
+|------------------------|--------|------------------|
+| index                  | None   | GPU index        |
+| name                   | None   | GPU name         |
+| utilization.gpu[%]     | None   | GPU utilization  |
+| utilization.memory[%]  | None   | Memory utilization |
+| memory.total[MiB]      | MiB    | Total memory     |
+| memory.used[MiB]       | MiB    | Used memory      |
+| memory.free[MiB]       | MiB    | Free memory      |
+| temperature.gpu        | None   | GPU temperature  |
diff --git 
a/home/i18n/zh-cn/docusaurus-plugin-content-docs/current/help/nvidia.md 
b/home/i18n/zh-cn/docusaurus-plugin-content-docs/current/help/nvidia.md
new file mode 100644
index 000000000..8e3f190ad
--- /dev/null
+++ b/home/i18n/zh-cn/docusaurus-plugin-content-docs/current/help/nvidia.md
@@ -0,0 +1,37 @@
+---
+id: nvidia  
+title: 监控:NVIDIA 监控      
+sidebar_label: NVIDIA 监控      
+keywords: [开源监控系统, NVIDIA监控]
+---
+
+> 对 NVIDIA 操作系统的通用性能指标进行采集监控。
+> NVIDIA 监控需要用到 nvidia-smi 命令,nvidia-smi 是与 NVIDIA GPU 驱动程序一起安装的。所以在监控 NVIDIA 
时,我们需要安装 NVIDIA GPU 驱动程序。
+
+### 配置参数
+
+|  参数名称  | 参数帮助描述                                                  |
+|--------|---------------------------------------------------------|
+| 监控Host | 被监控的对端 IPV4,IPV6 或 域名。注意⚠️不带协议头(eg: https://, http://)。 |
+| 任务名称   | 标识此监控的名称,名称需要保证唯一性。                                     |
+| 端口     | Linux SSH 对外提供的端口,默认为22。                                |
+| 用户名    | SSH 连接用户名,可选                                            |
+| 密码     | SSH 连接密码,可选                                             |
+| 采集间隔   | 监控周期性采集数据间隔时间,单位秒,可设置的最小间隔为30秒                          |
+| 是否探测   | 新增监控前是否先探测检查监控可用性,探测成功才会继续新增修改操作                        |
+| 描述备注   | 更多标识和描述此监控的备注信息,用户可以在这里备注信息                             |
+
+### 采集指标
+
+#### 指标集合:basic
+
+| 指标名称               | 指标单位 | 指标帮助描述 |
+|--------------------|------|--------|
+| index              | 无    | 显卡索引   |
+| name     | 无    | 显卡名称 |
+| utilization.gpu[%]    | 无    | GPU利用率 |
+| utilization.memory[%] | 无    | 显存利用率 |
+| memory.total[MiB]       | 无    | 总显存 |
+| memory.used[MiB]        | 无    | 已用显存 |
+| memory.free[MiB]        | 无    | 空闲显存 |
+| temperature.gpu    | 无    | 显卡温度 |
diff --git a/manager/src/main/resources/define/app-nvidia.yml 
b/manager/src/main/resources/define/app-nvidia.yml
new file mode 100644
index 000000000..583f2189b
--- /dev/null
+++ b/manager/src/main/resources/define/app-nvidia.yml
@@ -0,0 +1,198 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# The monitoring type category:service-application service monitoring 
db-database monitoring custom-custom monitoring os-operating system monitoring
+category: server
+# The monitoring type eg: linux windows tomcat mysql aws...
+app: nvidia
+# The monitoring i18n name
+name:
+  zh-CN: NVIDIA
+  en-US: NVIDIA
+# The description and help of this monitoring type
+help:
+  zh-CN: Hertzbeat 使用 <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'> SSH 协议</a> 对 
NVIDIA GPU显卡的通用性能指标进行采集监控。<br>您可以点击“<i>新建 
NVIDIA</i>”并配置HOST端口账户等相关参数进行添加,支持SSH账户密码或密钥认证。或者选择“<i>更多操作</i>”,导入已有配置。
+  en-US: Hertzbeat uses the <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'>SSH protocol</a> 
to collect and monitor general performance metrics of NVIDIA GPUs. <br> You can 
click " <i>Create NVIDIA</i> " to add and configure parameters such as HOST, 
port, account, etc., supporting SSH account password or key authentication. 
Alternatively, you can select " <i>More Actions</i> " to import an existing 
configuration.
+  zh-TW: Hertzbeat 使用 <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'>SSH 協議</a> 對 
NVIDIA GPU 顯卡的通用性能指標進行採集監控。<br>您可以點擊“<i>新建 NVIDIA</i>”並配置 
HOST、端口、帳戶等相關參數進行添加,支持 SSH 帳戶密碼或密鑰認證。或者選擇“<i>更多操作</i>”,導入已有配置。
+helpLink:
+  zh-CN: https://hertzbeat.apache.org/zh-cn/docs/help/nvidia/
+  en-US: https://hertzbeat.apache.org/docs/help/nvidia/
+# Input params define for monitoring(render web ui by the definition)
+params:
+  # field-param field key
+  - field: host
+    # name-param field display i18n name
+    name:
+      zh-CN: 目标Host
+      en-US: Target Host
+    # type-param field type(most mapping the html input type)
+    type: host
+    # required-true or false
+    required: true
+  # field-param field key
+  - field: port
+    # name-param field display i18n name
+    name:
+      zh-CN: 端口
+      en-US: Port
+    # type-param field type(most mapping the html input type)
+    type: number
+    # when type is number, range is required
+    range: '[0,65535]'
+    # required-true or false
+    required: true
+    # default value
+    defaultValue: 22
+  # field-param field key
+  - field: timeout
+    # name-param field display i18n name
+    name:
+      zh-CN: 超时时间(ms)
+      en-US: Timeout(ms)
+    # type-param field type(most mapping the html input type)
+    type: number
+    # when type is number, range is required
+    range: '[400,200000]'
+    # required-true or false
+    required: false
+    # default value
+    # 默认值
+    defaultValue: 6000
+  # field-param field key
+  - field: reuseConnection
+    # name-param field display i18n name
+    name:
+      zh-CN: 复用连接
+      en-US: Reuse Connection
+    # type-param field type(most mapping the html input type)
+    type: boolean
+    # required-true or false
+    required: true
+    defaultValue: false
+  # field-param field key
+  - field: username
+    # name-param field display i18n name
+    name:
+      zh-CN: 用户名
+      en-US: Username
+    # type-param field type(most mapping the html input type)
+    type: text
+    # when type is text, use limit to limit string length
+    limit: 50
+    # required-true or false
+    required: true
+  # field-param field key
+  - field: password
+    # name-param field display i18n name
+    name:
+      zh-CN: 密码
+      en-US: Password
+    # type-param field type(most mapping the html input tag)
+    type: password
+    # required-true or false
+    required: false
+  # field-param field key
+  - field: privateKey
+    # name-param field display i18n name
+    name:
+      zh-CN: 私钥
+      en-US: PrivateKey
+    # type-param field type(most mapping the html input type)
+    type: textarea
+    placeholder: -----BEGIN RSA PRIVATE KEY-----
+    # required-true or false
+    required: false
+    # hide param-true or false
+    hide: true
+# collect metrics config list
+metrics:
+  # metrics - basic, inner monitoring metrics (responseTime - response time)
+  - name: basic
+    i18n:
+      zh-CN: 显卡基本信息
+      en-US: Basic Information
+    # metrics scheduling priority(0->127)->(high->low), metrics with the same 
priority will be scheduled in parallel
+    # priority 0's metrics is availability metrics, it will be scheduled 
first, only availability metrics collect success will the scheduling continue
+    priority: 0
+    # collect metrics content
+    fields:
+      # field-metric name, type-metric type(0-number,1-string), unit-metric 
unit('%','ms','MB'), label-whether it is a metrics label field
+      - field: index
+        type: 1
+        label: true
+        i18n:
+          zh-CN: 显卡索引
+          en-US: Host Name
+      - field: name
+        type: 1
+        i18n:
+          zh-CN: 显卡名称
+          en-US: System Version
+      - field: utilization.gpu [%]
+        type: 0
+        unit: '%'
+        i18n:
+          zh-CN: GPU利用率
+          en-US: GPU Utilization
+      - field: utilization.memory [%]
+        type: 0
+        unit: '%'
+        i18n:
+          zh-CN: 显存利用率
+          en-US: Memory Utilization
+      - field: memory.total [MiB]
+        type: 1
+        unit: 'MiB'
+        i18n:
+          zh-CN: 总显存
+          en-US: Total Memory
+      - field: memory.used [MiB]
+        type: 0
+        unit: 'MiB'
+        i18n:
+          zh-CN: 已用显存
+          en-US: Used Memory
+      - field: memory.free [MiB]
+        type: 0
+        unit: 'MiB'
+        i18n:
+          zh-CN: 空闲显存
+          en-US: Free Memory
+      - field: temperature.gpu
+        type: 1
+        unit: '°C'
+        i18n:
+          zh-CN: 显卡温度
+          en-US: GPU Temperature
+    # the protocol used for monitoring, eg: sql, ssh, http, telnet, wmi, snmp, 
sdk
+    protocol: ssh
+    # the config content when protocol is ssh
+    ssh:
+      # ssh host: ipv4 ipv6 domain
+      host: ^_^host^_^
+      # ssh port
+      port: ^_^port^_^
+      # ssh username
+      username: ^_^username^_^
+      # ssh password
+      password: ^_^password^_^
+      # ssh private key
+      privateKey: ^_^privateKey^_^
+      timeout: ^_^timeout^_^
+      reuseConnection: ^_^reuseConnection^_^
+      # ssh run collect script
+      script: nvidia-smi 
--query-gpu=index,name,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free,temperature.gpu
 --format=csv,nounits | sed 's/ *, */,/g' | sed 's/ / /g' | sed 's/,/ /g'
+      # ssh response data parse type: oneRow, multiRow
+      parseType: multiRow


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to