Hi all, I hope that you can help me with this strange problem. I've got a nine node cluster which is configured with no-quorum-policy to stop. Two days ago I came across this error on one of the nodes:
Oct 14 00:00:38 kvm06 kernel: Uhhuh. NMI received for unknown reason a1 on CPU 0. Oct 14 00:00:38 kvm06 kernel: You have some hardware problem, likely on the PCI bus. Oct 14 00:00:38 kvm06 kernel: Dazed and confused, but trying to continue Oct 14 00:00:43 kvm06 corosync[2027]: [TOTEM ] A processor failed, forming new configuration. this error seemed to compromise the entire cluster activity. From this moment on I received a lot of other notifications concerning network connectivity all around the cluster. Everything ended with this: Oct 14 00:05:06 kvm01 cib: [18970]: notice: ais_dispatch_message: Membership 6924: quorum lost And with the stop of all the cluster's resources. I cannot exclude network connectivity problems, but since I've got stonith configured for every node (with ipmi, and it is working on a different and dedicated network channel), I was expecting that every unreachable node got fenced, and this does not happened. After the quorum error every cluster node went offline and the only way to make things work again was to stop corosync on the first node (the one with the "suspected" hardware problem). Of course I've checked the sanity of the hardware of this machine and everything seemed to be fine. What I don't understand is why I've lost the quorum since the problem seemed to interest just one node (and I've got 9 nodes in total). I know that without full logs it is impossible to understand the problem but maybe you can be helpful with some suggestion. Thanks a lot, -- RaSca Mia Mamma Usa Linux: Niente รจ impossibile da capire, se lo spieghi bene! [email protected] http://www.miamammausalinux.org _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
