Sorry for posting twice, not sure I used the correct mail list.
>
> Hey Folks,
>
> Ran into an interesting problem today. Topology is a 4-Node Cluster on v.5.2,
> been up and running without issues for 10 months or so. Several VMs running
> on node 1, 2 & 3, maybe 30% loaded. I keep node 4 empty for HA and new build
> VM testing. All VMs are set for HA and auto start on node reboot.
>
> Node 1 rebooted, no indication why, nothing in the logs identifying cause of
> the node rebooting. When node 1 came back up it was 1 hour ahead of time and
> none of the VMs would start and HA did not start up the VMs on empty node 4.
>
> Syslog:
> Jul 1 14:42:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul 1 14:42:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul 1 14:42:20 vmpve01 pvedaemon[18577]: <root@pam> successful auth for user
> 'NOC@pve'
> Jul 1 14:43:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul 1 14:43:00 vmpve01 pvedaemon[18577]: <root@pam> successful auth for user
> 'NOC@pve'
> Jul 1 14:43:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul 1 14:43:40 vmpve01 pvedaemon[32611]: <root@pam> successful auth for user
> 'NOC@pve'
> Jul 1 14:43:50 vmpve01 pveproxy[9369]: worker exit
> Jul 1 14:43:50 vmpve01 pveproxy[2395]: worker 9369 finished
> Jul 1 14:43:50 vmpve01 pveproxy[2395]: starting 1 worker(s)
> Jul 1 14:43:50 vmpve01 pveproxy[2395]: worker 21373 started
> Jul 1 14:44:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul 1 14:44:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> Jul 1 14:44:03 vmpve01 systemd[1]: Time has been changed
> Jul 1 14:44:03 vmpve01 systemd[1]: apt-daily-upgrade.timer: Adding 51min
> 27.103008s random time.
> Jul 1 14:44:03 vmpve01 systemd[1]: pve-daily-update.timer: Adding 49min
> 46.847880s random time.
> Jul 1 14:44:03 vmpve01 systemd[1]: apt-daily.timer: Adding 8h 36min
> 11.274821s random time.
> Jul 1 14:44:13 vmpve01 pvedaemon[14464]: <root@pam> successful auth for user
> 'NOC@pve'
> Jul 1 14:44:54 vmpve01 pvedaemon[32611]: <root@pam> successful auth for user
> 'NOC@pve'
> Jul 1 14:45:00 vmpve01 systemd[1]: Starting Proxmox VE replication runner...
> Jul 1 14:45:01 vmpve01 systemd[1]: Started Proxmox VE replication runner.
> ****NODE REBOOTED****
> Jul 1 15:45:38 vmpve01 systemd[1]: Started Load/Save Random Seed.
> Jul 1 15:45:38 vmpve01 systemd[1]: Started Flush Journal to Persistent
> Storage.
> Jul 1 15:45:38 vmpve01 systemd-modules-load[408]: Inserted module 'iscsi_tcp'
> Jul 1 15:45:38 vmpve01 systemd[1]: Mounted RPC Pipe File System.
> Jul 1 15:45:38 vmpve01 systemd[1]: Started udev Coldplug all Devices.
> …………..
> *****All normal logging cluster starting up then I start getting these
> re-occuring******
> Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-vmstor01: -1
> Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local: -1
> Jul 1 15:03:19 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vm
> pve01/local-lvm: -1
> Jul 1 15:03:23 vmpve01 rrdcached[2114]: queue_thread_main: rrd_update_r
> (/var/lib/rrdcached/db/pve2-node/vmpv
> e01) failed with status -1. (/var/lib/rrdcached/db/pve2-node/vmpve01: illegal
> attempt to update using time 156
> 2010251 when last update time is 1562011399 (minimum one second step))
> Jul 1 15:03:23 vmpve01 systemd[1]: Starting Cleanup of Temporary
> Directories...
> Jul 1 15:03:23 vmpve01 systemd[1]: Started Cleanup of Temporary Directories.
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-node/vmpve
> 01: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/105: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/104: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/106: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/102: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/103: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/107: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-vm/101: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local-lvm: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local: -1
> Jul 1 15:03:29 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-storage/vmpve01/local-vmstor01: -1
> Jul 1 15:03:37 vmpve01 pvedaemon[2366]: <root@pam> starting task
> UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root@pam:
> Jul 1 15:03:37 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul 1 15:03:38 vmpve01 pvedaemon[2366]: <root@pam> end task
> UPID:vmpve01:00001600:00016749:5D1A6719:hastart:107:root@pam: OK
> Jul 1 15:03:38 vmpve01 pmxcfs[2185]: [main] notice: ignore duplicate
> Jul 1 15:03:39 vmpve01 pmxcfs[2185]: [status] notice: RRDC update error
> /var/lib/rrdcached/db/pve2-node/vmpve01: -1
>
> After system time caught up with actual current time, the RRDC errors when
> away. Unless I missed it, there was no log indication of attempting to
> restart the VMs. I manually tried to start the VMs but none on node 1 would
> start up. After system time synchronized with hardware clock, I rebooted the
> node again and could start all the VMs on node 1. After manual reboot, I
> waited about 10 minutes, time was in sync OK, but Cluster still did not auto
> start or HA start any of node 1 VMs on node 4.
>
> Cluster is normal at this time:
>
> root@vmpve01:/var/log# pvec
> pveceph pvecm
> root@vmpve01:/var/log# pvecm status
> Quorum information
> ------------------
> Date: Mon Jul 1 21:12:29 2019
> Quorum provider: corosync_votequorum
> Nodes: 4
> Node ID: 0x00000001
> Ring ID: 1/80
> Quorate: Yes
>
> Votequorum information
> ----------------------
> Expected votes: 4
> Highest expected: 4
> Total votes: 4
> Quorum: 3
> Flags: Quorate
>
> Membership information
> ----------------------
> Nodeid Votes Name
> 0x00000001 1 10.99.96.191 (local)
> 0x00000002 1 10.99.96.192
> 0x00000003 1 10.99.96.193
> 0x00000004 1 10.99.96.194
>
> I was just about to roll into proxmox updates this week too. Any suggestions
> on diagnosing root cause, not of node reboot, but of why VMs would not HA
> start when the node came up an hour in the future?
>
> Thanks.
>
> JR
>
> JR Richardson
> Engineering for the Masses
> Chasing the Azeotrope
>
_______________________________________________
pve-user mailing list
pve-user@pve.proxmox.com
https://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user