Hi,
  We are running a two-node cluster using pacemaker 1.1.5-18.1 with heartbeat 
3.0.4-41.1.  We are experiencing what seems like network issues and cannot make 
heartbeat recover.  We are experiencing "message too long" and the systems can 
no longer sync.

Our ha.cf is as follows:
autojoin none
use_logd false
logfacility daemon
debug 0

# use the v2 cluster resource manager
crm yes

# the cluster communication happens via unicast on bond0 and hb1
# hb1 is direct connect
ucast hb1 169.254.1.3
ucast hb1 169.254.1.4
ucast bond0 172.28.102.21
ucast bond0 172.28.102.51
compression zlib
compression_threshold 30

# msgfmt
msgfmt netstring

# a node will be flagged as dead if there is not response for 20 seconds
deadtime 30
initdead 30
keepalive 250ms
uuidfrom nodename

# these are the node names participating in the cluster
# the names should match "uname -n" output on the system
node usrv-qpr2
node usrv-qpr5

We can ping all interfaces from both nodes.  One of the bonded NICs had some 
trouble, but we believe we have enough redundancy built in that it should be 
fine.
The issue we see that if we reboot the non DC node it can no longer sync with 
the DC.  The log from the non-dc node shows remote node cannot be reached.  
Crm_mon of the non-dc node shows:

Last updated: Fri Aug 19 07:39:05 2011
Stack: Heartbeat
Current DC: NONE
2 Nodes configured, 2 expected votes
26 Resources configured.
============

Node usrv-qpr2 (87df4a75-fa67-c05e-1a07-641fa79784e0): UNCLEAN (offline)
Node usrv-qpr5 (7fb57f74-fae5-d493-e2c7-e4eda2430217): UNCLEAN (offline)

>From the DC it looks like all is well.

I tried a cibadmin -Q from non DC and it can no longer contact the remote node.

I tried a cibadmin -S from the non DC to force a sync which times out with Call 
cib_sync failed (-41): Remote node did not respond.

On the DC side I see this:
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23249]: ERROR: write_child: write failure 
on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: glib: ucast_write: Unable 
to send HBcomm packet bond0 172.28.102.51:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23251]: ERROR: write_child: write failure 
on ucast bond0.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: glib: ucast_write: Unable 
to send HBcomm packet hb1 169.254.1.3:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23253]: ERROR: write_child: write failure 
on ucast hb1.: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: glib: ucast_write: Unable 
to send HBcomm packet hb1 169.254.1.4:694 len=83696 [-1]: Message too long
Aug 19 07:38:20 usrv-qpr2 heartbeat: [23255]: ERROR: write_child: write failure 
on ucast hb1.: Message too long
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is 
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is 
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 
for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, 
fromnode's ackseq = 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, 
lowseq=244443,ackseq=244435,lastmsg=442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Cannot rexmit pkt 244442 
for usrv-qpr5: seqno too low
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: fromnode =usrv-qpr5, 
fromnode's ackseq = 244435
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hist information:
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: info: hiseq =244943, 
lowseq=244443,ackseq=244435,lastmsg=442
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is 
filling up (500 messages in queue)
Aug 19 07:38:21 usrv-qpr2 heartbeat: [23222]: ERROR: Message hist queue is 
filling up (500 messages in queue)
Aug 19 07:38:22 usrv-qpr2 heartbeat: [23222]: info: all clients are now resumed

My questions:

1)      Seems like the compression is not working.  Is there something we need 
to do to enable it?  We have tried both bz2 and  zlib.  We've played with the 
compression threshold as well.

2)      How do we get the non DC system back on-line?  Rebooting does not work 
since the DC can't seem to send the diffs to sync it.

3)      If the diff it is trying to send is truly too long, how do I recover 
from that?

4)      Would more information be useful in diagnosing the problem?

Thanks in advance.
Diane Schaefer
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to