> Hi all,
> I hope that you can help me with this strange problem. I've got a nine
> node cluster which is configured with no-quorum-policy to stop.
> Two days ago I came across this error on one of the nodes:
> 
> Oct 14 00:00:38 kvm06 kernel: Uhhuh. NMI received for unknown reason a1
> on CPU 0.
> Oct 14 00:00:38 kvm06 kernel: You have some hardware problem, likely on
> the PCI bus.
> Oct 14 00:00:38 kvm06 kernel: Dazed and confused, but trying to continue
> Oct 14 00:00:43 kvm06 corosync[2027]:   [TOTEM ] A processor failed,
> forming new configuration.
> 
> this error seemed to compromise the entire cluster activity. From this
> moment on I received a lot of other notifications concerning network
> connectivity all around the cluster. Everything ended with this:


Hi,

I saw these kinds of errors with erratic network cards or drivers. On board 
cards or network equipment of blades seems to be that part where manufacturers 
optimize their costs.

If you have errors in the network you eventually loose packets. 
corosync/paceamker doesn't like this and sometimes reacts on heavy packet 
loss.

Greetings,

-- 
Dr. Michael Schwartzkopff
Guardinistr. 63
81375 München

Tel: (0163) 172 50 98

Attachment: signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to