I really hope to find an explanation to all this mess. Because i'm not very confident right now..
So far if i understand all this correctly.. I'm not very found of how watchdog behaves with crm/lrm. To make a comparison with PVE 3 (RedHat cluster), fencing happened on the corosync/cluster communication stack, but not on the resource manager stack. On PVE 3, several times I found rgmanager was stuck. I just had to find the culprit process (usually pve status), kill it, et voila. But it never caused an outage. > > 2 - There seems to be a bug in lrm. > > > > Tonight i have seen timeouts in qmstarts in /var/log/pve/tasks/active. > > Just after the timeouts, lrm was kind of stuck doing nothing. > > If it's doing nothing it would be interesting to see in which state it is. > Because if it's already online and active the watchdog must trigger if > it is stuck for ~60 seconds or more. I'll try to grab some info if it happens again. > Hmm, this means the watchdog was already running out. Do you have a hint why there is no messages in the logs when watchdog actually seems to trigger fencing ? Because when a node suddently reboots, i can't be sure if it's the watchdog, a hardware bug, kernel bug or whatever.. > Yeah I looked a bit through logs of two of your nodes, it looks like the > system hit quite some bottle necks.. > CRM/LRM run often in 'loop took to long' errors the filesystem also is > sometimes not writable. > You have in some logs some huge retransmit list from corosync. Yes, there were much retransmits on "9 Nov 14:56". This matches when we tried to switch network path, because at this time the nodes did not seem to talk to each other correctly (lrm waiting for quorum.) Anyway I need to triple check (again) IGMP snooping on all network switchs. + Check HP blades Virtual Connect and firmwares.. > Where does your cluster communication happens, not on the storage > network? Storage is on fibre channel. Cluster communication happens on a dedicated network vlan (shared with vmware.) I also use another vlan for live migrations. _______________________________________________ pve-user mailing list [email protected] http://pve.proxmox.com/cgi-bin/mailman/listinfo/pve-user
