[PVE-User] Cluster Host Node Rebooted and VMs Did not restart?

JR Richardson Thu, 04 Jul 2019 12:39:08 -0700
Sorry for posting twice, not sure I used the correct mail list.
> 
> Hey Folks,
>  
> Ran into an interesting problem today. Topology is a 4-Node Cluster on v.5.2, 
> been up and running without issues for 10 months or so. Several VMs running 
> on node 1, 2 & 3, maybe 30% loaded. I keep node 4 empty for HA and new build 
> VM testing. All VMs are set for HA and auto start on node reboot.
>  
> Node 1 rebooted, no indication why, nothing in the logs identifying cause of 
> the node rebooting. When node 1 came back up it was 1 hour ahead of time and 
> none of the VMs would start and HA did not start up the VMs on empty node 4.
>  
> Syslog:
> Jul  1 14:42:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:42:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:42:20 vmpve01 pvedaemon[18577]: <root@pam> successful auth for user 
> 'NOC@pve'
> Jul  1 14:43:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:43:00 vmpve01 pvedaemon[18577]: <root@pam> successful auth for user 
> 'NOC@pve'
> Jul  1 14:43:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:43:40 vmpve01 pvedaemon[32611]: <root@pam> successful auth for user 
> 'NOC@pve'
> Jul  1 14:43:50 vmpve01 pveproxy[9369]: worker exit
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: worker 9369 finished
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: starting 1 worker(s)
> Jul  1 14:43:50 vmpve01 pveproxy[2395]: worker 21373 started
> Jul  1 14:44:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:44:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul  1 14:44:03 vmpve01 systemd[1]: Time has been changed
> Jul  1 14:44:03 vmpve01 systemd[1]: apt-daily-upgrade.timer: Adding 51min 
> 27.103008s random time.
> Jul  1 14:44:03 vmpve01 systemd[1]: pve-daily-update.timer: Adding 49min 
> 46.847880s random time.
> Jul  1 14:44:03 vmpve01 systemd[1]: apt-daily.timer: Adding 8h 36min 
> 11.274821s random time.
> Jul  1 14:44:13 vmpve01 pvedaemon[14464]: <root@pam> successful auth for user 
> 'NOC@pve'
> Jul  1 14:44:54 vmpve01 pvedaemon[32611]: <root@pam> successful auth for user 
> 'NOC@pve'
> Jul  1 14:45:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul  1 14:45:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> ****NODE REBOOTED****
> Jul  1 15:45:38 vmpve01 systemd[1]: Started Load/Save Random Seed.
> Jul  1 15:45:38 vmpve01 systemd[1]: Started Flush Journal to Persistent 
> Storage.
> Jul  1 15:45:38 vmpve01 systemd-modules-load[408]: Inserted module 'iscsi_tcp'
> Jul  1 15:45:38 vmpve01 systemd[1]: Mounted RPC Pipe File System.
> Jul  1 15:45:38 vmpve01 systemd[1]: Started udev Coldplug all Devices.
> …………..
> *****All normal logging cluster starting up then I start getting these 
> re-occuring******
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-vmstor01: -1
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local: -1
> Jul  1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-lvm: -1
> Jul  1 15:03:23 vmpve01 rrdcached[2114]: queue_thread_main: rrd_update_r 
> (/var/lib/rrdcached/db/pve2-node/vmpv
> e01) failed with status -1. (/var/lib/rrdcached/db/pve2-node/vmpve01: illegal 
> attempt to update using time 156
> 2010251 when last update time is 1562011399 (minimum one second step))
> Jul  1 15:03:23 vmpve01 systemd[1]: Starting Cleanup of Temporary 
> Directories...
> Jul  1 15:03:23 vmpve01 systemd[1]: Started Cleanup of Temporary Directories.
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-node/vmpve
> 01: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/105: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/104: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/106: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/102: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/103: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/107: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-vm/101: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local-lvm: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local: -1
> Jul  1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local-vmstor01: -1
> Jul  1 15:03:37 vmpve01 pvedaemon[2366]: <root@pam> starting task 
> UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root@pam:
> Jul  1 15:03:37 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul  1 15:03:38 vmpve01 pvedaemon[2366]: <root@pam> end task 
> UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root@pam: OK
> Jul  1 15:03:38 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul  1 15:03:39 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error 
> /var/lib/rrdcached/db/pve2-node/vmpve01: -1
>  
> After system time caught up with actual current time, the RRDC errors when 
> away. Unless I missed it, there was no log indication of attempting to 
> restart the VMs. I manually tried to start the VMs but none on node 1 would 
> start up. After system time synchronized with hardware clock, I rebooted the 
> node again and could start all the VMs on node 1. After manual reboot, I 
> waited about 10 minutes, time was in sync OK, but Cluster still did not auto 
> start or HA start any of node 1 VMs on node 4.
>  
> Cluster is normal at this time:
>  
> root@vmpve01:/var/log# pvec
> pveceph  pvecm   
> root@vmpve01:/var/log# pvecm status
> Quorum information
> ------------------
> Date:             Mon Jul  1 21:12:29 2019
> Quorum provider:  corosync_votequorum
> Nodes:            4
> Node ID:          0x00000001
> Ring ID:          1/80
> Quorate:          Yes
>  
> Votequorum information
> ----------------------
> Expected votes:   4
> Highest expected: 4
> Total votes:      4
> Quorum:           3 
> Flags:            Quorate
>  
> Membership information
> ----------------------
>     Nodeid      Votes Name
> 0x00000001          1 10.99.96.191 (local)
> 0x00000002          1 10.99.96.192
> 0x00000003          1 10.99.96.193
> 0x00000004          1 10.99.96.194
>  
> I was just about to roll into proxmox updates this week too. Any suggestions 
> on diagnosing root cause, not of node reboot, but of why VMs would not HA 
> start when the node came up an hour in the future?
>  
> Thanks.
> 
> JR
>  
> JR Richardson
> Engineering for the Masses
> Chasing the Azeotrope
>  
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
[PVE-User] Cluster Host Node Rebooted and VMs Did not restart?

Reply via email to