Re: [Linux-HA] How to analyze node failure

Digimer Tue, 11 Jun 2013 08:18:51 -0700

On 06/11/2013 11:04 AM, Stefan Schloesser wrote:

Hi,


I have a setup with 2 nodes, drbd, mysql and apache. Rather too often for my 
liking (1 per month) one node is killed (fenced) by the other. Each time I am 
unable to find out what actually caused this behaviour.
I can see in the logs that suddenly one node is fenced or stonith but no error 
appears as to why this happens.
Each time I can simple start the node and corosync and everything works fine 
again i.e. no fault is apparent.

I already thought about auto starting corosync, but that does seem like a good 
idea. I tried trimming the communication params (totem) to no avail.

So my question is this. What's the best way to finde the cause?

Stefan Schlösser

This sounds like a problem with the network. Do you see something like"token didn't arrive in time" (I'm guessing on the wording) on thesurviving node?

You might want to check that you have persistent multicast groups set inyour switch(es). You might also want to setup bonding on the corosyncinterface (Active/Passive is best) and/or redundant ring protocol.

It might be that something on the failed node tried to log but wasfenced before the buffer wrote out to the logs?


--
Digimer
Papers and Projects: https://alteeve.ca/w/

What if the cure for cancer is trapped in the mind of a person withoutaccess to education?

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] How to analyze node failure

Reply via email to