The ipmi_watchdog is a hardware watchdog which the OS pokes to keep happy. If the OS hangs/crashes and therefore fails to poke it, then the IPMI watchdog will reset the system. It will not catch the case of an individual daemon/process, like corosync, hanging/crashing on the system.
On Wed, Aug 15, 2018 at 4:41 AM Dmitry Petuhov <[email protected]> wrote: > Week ago on one of my PVE nodes suddenly crashed corosync. > > -------------->8========= > corosync[4701]: error [TOTEM ] FAILED TO RECEIVE > corosync[4701]: [TOTEM ] FAILED TO RECEIVE > corosync[4701]: notice [TOTEM ] A new membership (10.19.92.53:1992) was > formed. Members left: 1 2 4 > corosync[4701]: notice [TOTEM ] Failed to receive the leave message. > failed: 1 2 4 > corosync[4701]: [TOTEM ] A new membership (10.19.92.53:1992) was > formed. Members left: 1 2 4 > corosync[4701]: [TOTEM ] Failed to receive the leave message. failed: 1 2 > 4 > corosync[4701]: notice [QUORUM] This node is within the non-primary > component and will NOT provide any services. > corosync[4701]: notice [QUORUM] Members[1]: 3 > corosync[4701]: notice [MAIN ] Completed service synchronization, > ready to provide service. > corosync[4701]: [QUORUM] This node is within the non-primary component > and will NOT provide any services. > corosync[4701]: [QUORUM] Members[1]: 3 > corosync[4701]: [MAIN ] Completed service synchronization, ready to > provide service. > kernel: [29187555.500409] dlm: closing connection to node 2 > corosync[4701]: notice [TOTEM ] A new membership (10.19.92.51:2000) was > formed. Members joined: 1 2 4 > corosync[4701]: [TOTEM ] A new membership (10.19.92.51:2000) was > formed. Members joined: 1 2 4 > corosync[4701]: notice [QUORUM] This node is within the primary > component and will provide service. > corosync[4701]: notice [QUORUM] Members[4]: 1 2 3 4 > corosync[4701]: notice [MAIN ] Completed service synchronization, > ready to provide service. > corosync[4701]: [QUORUM] This node is within the primary component and > will provide service. > corosync[4701]: notice [CFG ] Killed by node 1: dlm_controld > corosync[4701]: error [MAIN ] Corosync Cluster Engine exiting with > status -1 at cfg.c:530. > corosync[4701]: [QUORUM] Members[4]: 1 2 3 4 > corosync[4701]: [MAIN ] Completed service synchronization, ready to > provide service. > dlm_controld[688]: 29187298 daemon node 4 stateful merge > dlm_controld[688]: 29187298 receive_start 4:6 add node with started_count 2 > dlm_controld[688]: 29187298 daemon node 1 stateful merge > dlm_controld[688]: 29187298 receive_start 1:5 add node with started_count 4 > dlm_controld[688]: 29187298 daemon node 2 stateful merge > dlm_controld[688]: 29187298 receive_start 2:17 add node with > started_count 13 > corosync[4701]: [CFG ] Killed by node 1: dlm_controld > corosync[4701]: [MAIN ] Corosync Cluster Engine exiting with status -1 > at cfg.c:530. > dlm_controld[688]: 29187298 cpg_dispatch error 2 > dlm_controld[688]: 29187298 process_cluster_cfg cfg_dispatch 2 > dlm_controld[688]: 29187298 cluster is down, exiting > dlm_controld[688]: 29187298 process_cluster quorum_dispatch 2 > dlm_controld[688]: 29187298 daemon cpg_dispatch error 2 > systemd[1]: corosync.service: Main process exited, code=exited, > status=255/n/a > systemd[1]: corosync.service: Unit entered failed state. > systemd[1]: corosync.service: Failed with result 'exit-code'. > kernel: [29187556.903177] dlm: closing connection to node 4 > kernel: [29187556.906730] dlm: closing connection to node 3 > dlm_controld[688]: 29187298 abandoned lockspace hp-big-gfs > kernel: [29187556.924279] dlm: dlm user daemon left 1 lockspaces > -------------->8========= > > > But node did not rebooted. > > I use WATCHDOG_MODULE=ipmi_watchdog. Watchdog still running: > > > -------------->8========= > > # ipmitool mc watchdog get > Watchdog Timer Use: SMS/OS (0x44) > Watchdog Timer Is: Started/Running > Watchdog Timer Actions: Hard Reset (0x01) > Pre-timeout interval: 0 seconds > Timer Expiration Flags: 0x10 > Initial Countdown: 10 sec > Present Countdown: 9 sec > > -------------->8========= > > > The only down service is corosync. > > > -------------->8========= > > # pveversion --verbose > proxmox-ve: 5.0-21 (running kernel: 4.10.17-2-pve) > pve-manager: 5.0-31 (running version: 5.0-31/27769b1f) > pve-kernel-4.10.17-2-pve: 4.10.17-20 > pve-kernel-4.10.17-3-pve: 4.10.17-21 > libpve-http-server-perl: 2.0-6 > lvm2: 2.02.168-pve3 > corosync: 2.4.2-pve3 > libqb0: 1.0.1-1 > pve-cluster: 5.0-12 > qemu-server: 5.0-15 > pve-firmware: 2.0-2 > libpve-common-perl: 5.0-16 > libpve-guest-common-perl: 2.0-11 > libpve-access-control: 5.0-6 > libpve-storage-perl: 5.0-14 > pve-libspice-server1: 0.12.8-3 > vncterm: 1.5-2 > pve-docs: 5.0-9 > pve-qemu-kvm: 2.9.0-5 > pve-container: 2.0-15 > pve-firewall: 3.0-2 > pve-ha-manager: 2.0-2 > ksm-control-daemon: 1.2-2 > glusterfs-client: 3.8.8-1 > lxc-pve: 2.0.8-3 > lxcfs: 2.0.7-pve4 > criu: 2.11.1-1~bpo90 > novnc-pve: 0.6-4 > smartmontools: 6.5+svn4324-1 > zfsutils-linux: 0.6.5.11-pve17~bpo90 > gfs2-utils: 3.1.9-2 > openvswitch-switch: 2.7.0-2 > ceph: 12.2.0-pve1 > > -------------->8========= > > > I also have GFS2 in this cluster, which did not stop work after corosync > crash (which scares me most). > > > Shouldn't node reboot on corosync fail, and why it can still run? Or > shall node have HA VMs to reboot, and just stay as it is if there's only > regular autostarted VMs and no HA machines present? > > > _______________________________________________ > pve-user mailing list > [email protected] > https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user > _______________________________________________ pve-user mailing list [email protected] https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
