Re: [I] linux 进程死亡不推送报警 [hertzbeat]

via GitHub Sat, 09 Nov 2024 04:16:37 -0800


LiuTianyou commented on issue #2806:
URL: https://github.com/apache/hertzbeat/issues/2806#issuecomment-2466192848


   Hi, if you are using the default Linux process monitoring template, you can 
replace it with the following one to solve the problem of no alarm when the 
process exits abnormally
   ```yml
   # Licensed to the Apache Software Foundation (ASF) under one or more
   # contributor license agreements.  See the NOTICE file distributed with
   # this work for additional information regarding copyright ownership.
   # The ASF licenses this file to You under the Apache License, Version 2.0
   # (the "License"); you may not use this file except in compliance with
   # the License.  You may obtain a copy of the License at
   #
   #     http://www.apache.org/licenses/LICENSE-2.0
   #
   # Unless required by applicable law or agreed to in writing, software
   # distributed under the License is distributed on an "AS IS" BASIS,
   # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   # See the License for the specific language governing permissions and
   # limitations under the License.
   
   # The monitoring type category：service-application service monitoring 
db-database monitoring custom-custom monitoring os-operating system monitoring
   category: program
   # The monitoring type eg: linux windows tomcat mysql aws...
   app: process
   # The monitoring i18n name
   name:
     zh-CN: Linux进程
     en-US: Linux Process
   # The description and help of this monitoring type
   help:
     zh-CN: Hertzbeat 使用 <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'> SSH 协议</a> 对 
Linux系统进程进行监控，支持根据进程名称(或部分名称)匹配进行监控，支持进程的 CPU使用率、内存使用率、物理内存、IO 
等监控。<br>您可以点击“<i>新建 
Linux进程</i>”并配置HOST端口账户等相关参数进行添加，支持SSH账户密码或密钥认证。或者选择“<i>更多操作</i>”，导入已有配置。
     en-US: Hertzbeat uses <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'>SSH protocol</a> 
to monitor Linux system processes. It supports monitoring based on process 
names (or partial names), and provides monitoring for CPU usage, memory usage, 
physical memory, IO, and more. <br> You can click on "<i>New Linux Process</i>" 
and configure HOST port, account, and other related parameters for addition. It 
supports SSH account password or key authentication. Alternatively, you can 
choose "<i>More Actions</i>" to import existing configurations.
     zh-TW: Hertzbeat 使用 <a class='help_module_content' 
href='https://hertzbeat.apache.org/docs/advanced/extend-ssh'>SSH 協議</a> 對 Linux 
系統進程進行監控，支持根據進程名稱（或部分名稱）匹配進行監控，支持進程的 CPU 使用率、內存使用率、物理內存、IO 等監控。<br>您可以點擊“新建 
Linux 进程”，並配置 HOST 端口、帳戶等相關參數進行添加，支持 SSH 帳戶密碼或金鑰認證。或者選擇“更多操作”，導入已有配置。
   helpLink:
     zh-CN: https://hertzbeat.apache.org/zh-cn/docs/help/process/
     en-US: https://hertzbeat.apache.org/docs/help/process/
   # Input params define for monitoring(render web ui by the definition)
   params:
     # field-param field key
     - field: host
       # name-param field display i18n name
       name:
         zh-CN: 目标Host
         en-US: Target Host
       # type-param field type(most mapping the html input type)
       type: host
       # required-true or false
       required: true
     # field-param field key
     - field: port
       # name-param field display i18n name
       name:
         zh-CN: 端口
         en-US: Port
       # type-param field type(most mapping the html input type)
       type: number
       # when type is number, range is required
       range: '[0,65535]'
       # required-true or false
       required: true
       # default value
       # 默认值
       defaultValue: 22
     # field-param field key
     - field: timeout
       # name-param field display i18n name
       name:
         zh-CN: 超时时间(ms)
         en-US: Timeout(ms)
       # type-param field type(most mapping the html input type)
       type: number
       # when type is number, range is required
       range: '[400,200000]'
       # required-true or false
       required: false
       # default value
       defaultValue: 6000
     # field-param field key
     - field: reuseConnection
       # name-param field display i18n name
       name:
         zh-CN: 复用连接
         en-US: Reuse Connection
       # type-param field type(most mapping the html input type)
       type: boolean
       # required-true or false
       required: true
       defaultValue: true
     # field-param field key
     - field: username
       # name-param field display i18n name
       name:
         zh-CN: 用户名
         en-US: Username
       # type-param field type(most mapping the html input type)
       type: text
       # when type is text, use limit to limit string length
       limit: 50
       # required-true or false
       required: true
     # field-param field key
     - field: password
       # name-param field display i18n name
       name:
         zh-CN: 密码
         en-US: Password
       # type-param field type(most mapping the html input tag)
       type: password
       # required-true or false
       required: false
     # field-param field key
     - field: privateKey
       # name-param field display i18n name
       name:
         zh-CN: 私钥
         en-US: PrivateKey
       # type-param field type(most mapping the html input type)
       type: textarea
       placeholder: -----BEGIN RSA PRIVATE KEY-----
       # required-true or false
       required: false
       # hide param-true or false
       hide: true
     - field: process_name
       # name-param field display i18n name
       name:
         zh-CN: 进程名称
         en-US: process_name
       # type-param field type(most mapping the html input type)
       type: text
       # when type is text, use limit to limit string length
       limit: 100
       # required-true or false
       required: true
   # collect metrics config list
   metrics:
     # metrics - basic, inner monitoring metrics (responseTime - response time)
     - name: basic
       i18n:
         zh-CN: 进程基本信息
         en-US: Basic Info
       # metrics scheduling priority(0->127)->(high->low), metrics with the 
same priority will be scheduled in parallel
       # priority 0's metrics is availability metrics, it will be scheduled 
first, only availability metrics collect success will the scheduling continue
       priority: 0
       # collect metrics content
   
       fields:
         # field-metric name, type-metric type(0-number,1-string), unit-metric 
unit('%','ms','MB'), label-whether it is a metrics label field
         - field: pid
           type: 1
           label: true
           i18n:
             zh-CN: 进程ID
             en-US: PID
         - field: user
           type: 1
           label: true
           i18n:
             zh-CN: 用户
             en-US: User
         - field: cpu
           type: 0
           i18n:
             zh-CN: CPU使用率
             en-US: CPU
         - field: mem
           type: 0
           i18n:
             zh-CN: 内存使用率
             en-US: MEM
         - field: rss
           type: 0
           unit: MB
           i18n:
             zh-CN: 物理内存
             en-US: rss
         - field: cmd
           type: 1
           i18n:
             zh-CN: 运行命令
             en-US: cmd
       units:
         - rss=KB->MB
       # the protocol used for monitoring, eg: sql, ssh, http, telnet, wmi, 
snmp, sdk
       protocol: ssh
       # the config content when protocol is ssh
       ssh:
         # ssh host: ipv4 ipv6 domain
         host: ^_^host^_^
         # ssh port
         port: ^_^port^_^
         # ssh username
         username: ^_^username^_^
         # ssh password
         password: ^_^password^_^
         # ssh private key
         privateKey: ^_^privateKey^_^
         timeout: ^_^timeout^_^
         reuseConnection: ^_^reuseConnection^_^
         # ssh run collect script
         # ssh response data parse type: oneRow, multiRow
         script: output=$(ps -ef|grep '^_^process_name^_^'|grep -v grep); [ -n 
"$output" ] && ps -eo pid,user,%cpu,%mem,rss,cmd | grep -v grep | grep 
'^_^process_name^_^' | awk 'BEGIN {print "pid user cpu mem rss cmd"} 
{cmd=substr($0, index($0, $6)); gsub(/ /, " ", cmd); print $1, $2, $3, $4, $5, 
cmd}'
         parseType: multiRow
     - name: mem
       i18n:
         zh-CN: 内存使用信息
         en-US: MEM
       # metrics scheduling priority(0->127)->(high->low), metrics with the 
same priority will be scheduled in parallel
       # priority 0's metrics is availability metrics, it will be scheduled 
first, only availability metrics collect success will the scheduling continue
   
       priority: 1
       # collect metrics content
   
       fields:
         # field-metric name, type-metric type(0-number,1-string), unit-metric 
unit('%','ms','MB'), label-whether it is a metrics label field
   
         - field: pid
           type: 1
           label: true
           i18n:
             zh-CN: 进程ID
             en-US: PID
         - field: metric
           type: 1
           i18n:
             zh-CN: 详细监控指标
             en-US: detail
       # the protocol used for monitoring, eg: sql, ssh, http, telnet, wmi, 
snmp, sdk
       protocol: ssh
       # the config content when protocol is ssh
       ssh:
         # ssh host: ipv4 ipv6 domain
         host: ^_^host^_^
         # ssh port
         port: ^_^port^_^
         # ssh username
         username: ^_^username^_^
         # ssh password
         password: ^_^password^_^
         # ssh private key
         privateKey: ^_^privateKey^_^
         timeout: ^_^timeout^_^
         reuseConnection: ^_^reuseConnection^_^
         # ssh run collect script
         # ssh response data parse type: oneRow, multiRow
         script: echo "pid metric" ; ps -eo pid,cmd | grep -v grep | grep 
'^_^process_name^_^' | awk '{cmd=substr($0, index($0, $3)); gsub(/ /, " ", 
cmd); print $1, cmd}' | while read pid _; do cat "/proc/$pid/status" | sed 's/ 
/ /g'| grep Vm | sed -e "s/VmPeak:/虚拟内存峰值:/g" -e "s/VmSize:/当前虚拟内存使用:/g" -e 
"s/VmLck:/锁定内存:/g" -e "s/VmPin:/固定内存:/g" -e "s/VmHWM:/物理内存峰值:/g" -e 
"s/VmRSS:/当前物理内存使用:/g" -e "s/VmData:/数据段大小:/g" -e "s/VmStk:/堆栈大小:/g" -e 
"s/VmExe:/代码大小:/g" -e "s/VmLib:/共享库大小:/g" -e "s/VmPTE:/页表项大小:/g" -e 
"s/VmSwap:/交换空间使用:/g" | sed "s/^/$pid /" ; done
         parseType: multiRow
   
     - name: other
       i18n:
         zh-CN: 其他监控信息
         en-US: Other
       # metrics scheduling priority(0->127)->(high->low), metrics with the 
same priority will be scheduled in parallel
       # priority 0's metrics is availability metrics, it will be scheduled 
first, only availability metrics collect success will the scheduling continue
       priority: 1
       # collect metrics content
       fields:
         # field-metric name, type-metric type(0-number,1-string), unit-metric 
unit('%','ms','MB'), label-whether it is a metrics label field
         - field: pid
           type: 1
           label: true
           i18n:
             zh-CN: 进程ID
             en-US: PID
         - field: path
           type: 1
           i18n:
             zh-CN: 执行路径
             en-US: path
         - field: date
           type: 1
           i18n:
             zh-CN: 启动时间
             en-US: date
         - field: fd_count
           type: 1
           i18n:
             zh-CN: 打开文件描述符数量
             en-US: fd_count
       # the protocol used for monitoring, eg: sql, ssh, http, telnet, wmi, 
snmp, sdk
       protocol: ssh
       # the config content when protocol is ssh
       ssh:
         # ssh host: ipv4 ipv6 domain
         host: ^_^host^_^
         # ssh port
         port: ^_^port^_^
         # ssh username
         username: ^_^username^_^
         # ssh password
         password: ^_^password^_^
         # ssh private key
         privateKey: ^_^privateKey^_^
         timeout: ^_^timeout^_^
         reuseConnection: ^_^reuseConnection^_^
         # ssh run collect script
         # ssh response data parse type: oneRow, multiRow
         script: echo "pid path date fd_count" ; ps -eo pid,cmd | grep -v grep 
| grep '^_^process_name^_^' | awk '{cmd=substr($0, index($0, $3)); gsub(/ /, " 
", cmd); print $1, cmd}' | while read pid _; do cwd=$(readlink -f 
"/proc/$pid/cwd"); start_time=$(ps -p $pid -o lstart | tail -n 1 | sed 's/ / 
/g'); fd_count=$(ls -l /proc/$pid/fd/ 2>/dev/null | wc -l); echo "$pid $cwd 
$start_time $fd_count"; done
         parseType: multiRow
   
     - name: io
       i18n:
         zh-CN: IO
         en-US: IO
       # metrics scheduling priority(0->127)->(high->low), metrics with the 
same priority will be scheduled in parallel
       # priority 0's metrics is availability metrics, it will be scheduled 
first, only availability metrics collect success will the scheduling continue
       priority: 1
       # collect metrics content
       fields:
         # field-metric name, type-metric type(0-number,1-string), unit-metric 
unit('%','ms','MB'), label-whether it is a metrics label field
         - field: pid
           type: 1
           label: true
           i18n:
             zh-CN: 进程ID
             en-US: PID
         - field: metric
           type: 1
           i18n:
             zh-CN: 监控指标名称
             en-US: metric
         - field: value
           type: 1
           i18n:
             zh-CN: 监控指标值
             en-US: value
       # the protocol used for monitoring, eg: sql, ssh, http, telnet, wmi, 
snmp, sdk
       protocol: ssh
       # the config content when protocol is ssh
       ssh:
         # ssh host: ipv4 ipv6 domain
         host: ^_^host^_^
         # ssh port
         port: ^_^port^_^
         # ssh username
         username: ^_^username^_^
         # ssh password
         password: ^_^password^_^
         # ssh private key
         privateKey: ^_^privateKey^_^
         timeout: ^_^timeout^_^
         reuseConnection: ^_^reuseConnection^_^
         # ssh run collect script
         # ssh response data parse type: oneRow, multiRow
         script: echo "pid metric value" ; ps -eo pid,cmd | grep -v grep | grep 
'^_^process_name^_^' | awk '{cmd=substr($0, index($0, $3)); gsub(/ /, " ", 
cmd); print $1, cmd}' | while read pid _; do cat "/proc/$pid/io" | sed -e 
"s/rchar:/rchar(进程从磁盘或其他文件读取的总字节数):/g" -e 
"s/wchar:/wchar(进程写入到磁盘或其他文件的总字节数):/g" -e "s/syscr:/syscr(进程发起的读取操作的次数):/g" -e 
"s/syscw:/syscw(进程发起的写入操作的次数):/g" -e 
"s/read_bytes:/read_bytes(进程从磁盘实际读取的字节数):/g" -e 
"s/write_bytes:/write_bytes(进程写入到磁盘的实际字节数):/g" -e 
"s/cancelled_write_bytes:/cancelled_write_bytes(进程写入但被取消的字节数):/g" | sed 
"s/^/$pid /" ; done
         parseType: multiRow
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] linux 进程死亡 不推送报警 [hertzbeat]

Reply via email to

Re: [I] linux 进程死亡不推送报警 [hertzbeat]