> Hi all, > I hope that you can help me with this strange problem. I've got a nine > node cluster which is configured with no-quorum-policy to stop. > Two days ago I came across this error on one of the nodes: > > Oct 14 00:00:38 kvm06 kernel: Uhhuh. NMI received for unknown reason a1 > on CPU 0. > Oct 14 00:00:38 kvm06 kernel: You have some hardware problem, likely on > the PCI bus. > Oct 14 00:00:38 kvm06 kernel: Dazed and confused, but trying to continue > Oct 14 00:00:43 kvm06 corosync[2027]: [TOTEM ] A processor failed, > forming new configuration. > > this error seemed to compromise the entire cluster activity. From this > moment on I received a lot of other notifications concerning network > connectivity all around the cluster. Everything ended with this:
Hi, I saw these kinds of errors with erratic network cards or drivers. On board cards or network equipment of blades seems to be that part where manufacturers optimize their costs. If you have errors in the network you eventually loose packets. corosync/paceamker doesn't like this and sometimes reacts on heavy packet loss. Greetings, -- Dr. Michael Schwartzkopff Guardinistr. 63 81375 München Tel: (0163) 172 50 98
signature.asc
Description: This is a digitally signed message part.
_______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
