On 06/11/2013 11:04 AM, Stefan Schloesser wrote:
Hi,
I have a setup with 2 nodes, drbd, mysql and apache. Rather too often for my
liking (1 per month) one node is killed (fenced) by the other. Each time I am
unable to find out what actually caused this behaviour.
I can see in the logs that suddenly one node is fenced or stonith but no error
appears as to why this happens.
Each time I can simple start the node and corosync and everything works fine
again i.e. no fault is apparent.
I already thought about auto starting corosync, but that does seem like a good
idea. I tried trimming the communication params (totem) to no avail.
So my question is this. What's the best way to finde the cause?
Stefan Schlösser
This sounds like a problem with the network. Do you see something like
"token didn't arrive in time" (I'm guessing on the wording) on the
surviving node?
You might want to check that you have persistent multicast groups set in
your switch(es). You might also want to setup bonding on the corosync
interface (Active/Passive is best) and/or redundant ring protocol.
It might be that something on the failed node tried to log but was
fenced before the buffer wrote out to the logs?
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems