I have two nodes running heartbeat 3.0.5 and pacemaker 1.1.6 (both from the 
linux-ha lucid ppa). They are running 11 groups each comprising an 
ocf:heartbeat:IPaddr2, an ocf:heartbeat:SendArp and an ocf:heartbeat:MailTo.

There is also a mailto resource configured for the overall cluster.

Despite all these, all the notifications I ever receive look identical:

    Heartbeat status change: Migrating resource away at Mon Aug  5 13:01:49 UTC 
2013 from proxy2
    Command line was:
    /usr/lib/ocf/resource.d//heartbeat/MailTo stop

One major omission here is that it doesn't tell me which resource it migrated.

Is there some way of configuring the cluster itself to send notifications so 
that I can remove the individual mailto resources?

Coincidentally (?), I've just started to get this problem:

Aug  5 11:13:50 proxy1 heartbeat: [2958]: ERROR: glib: ucast_write: Unable to 
send HBcomm packet eth0 192.168.1.10:694 len=78903 [-1]: Message too long
Aug  5 11:13:50 proxy1 heartbeat: [2958]: ERROR: write_child: write failure on 
ucast eth0.: Message too long

This (well at least I assume it's this) is resulting in resources disappearing, 
randomly starting and stopping, flip-flopping between nodes, marking nodes as 
offline and more fun things to keep us awake at night.

The only explanation I've found for this is here 
http://comments.gmane.org/gmane.linux.highavailability.pacemaker/10765
The solutions suggested are to alter compression settings (which we were not 
using before), migrate to corosync and/or to make the cib smaller, hence the 
idea of removing the individual mailtos.

I've run hb_report and that doesn't say anything useful, more or less "it 
doesn't work".

I'd like to migrate to corosync if it's better, but I'm extremely wary of 
touching anything in the cluster.

Marcus
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to