On 26 Jun 2012, at 22:18, Andreas Kurz wrote:

> use STONITH to prevent resources running on both nodes ... you
> configured redundant cluster communication paths?

The nodes in question are Linode VMs, so not much opportunity for that.

> With heartbeat you can use the "cl_status" command with its various
> options to check Heartbeats view of the cluster .... and heartbeats log
> messages from the split-brain event should also give you some hints.

cl_status just confirms that each node thinks the other is dead.

ok, I see two things happening in the logs: At one point proxy2 reported a slow 
heartbeat (20sec, deadtime was set to 15) but seemed to reconnect.

Later on, both nodes reported each other as dead within the same second:

Jun 25 10:14:16 proxy1 heartbeat: [2678]: WARN: node proxy2.example.com: is dead
Jun 25 10:14:16 proxy1 heartbeat: [2678]: info: Link proxy2.example.com:eth0 
dead.
Jun 25 10:14:16 proxy1 crmd: [3205]: notice: crmd_ha_status_callback: Status 
update: Node proxy2.example.com now has status [dead]

As I understand it, STONITH is intended to prevent a node rejoining in case it 
causes more trouble. In this case the individual nodes were fine, it appeared 
to be the network that was at fault. Why wouldn't these nodes automatically 
reconnect, given that there is no STONITH to prevent them? How should I tell 
them to reconnect manually?

I can also see that it failed to send alerts from the email resources at the 
same time because DNS lookups were failing: all points to a wider network issue.

I wonder if Linode has micro-outages on their network since we've also been 
seeing some problems with mmm reporting 'network unreachable' on some other 
instances at the same time.

Marcus
-- 
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK info@hand CRM solutions
[email protected] | http://www.synchromedia.co.uk/



_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to