Re: [Linux-HA] Strange split-brain behavior

Dejan Muhamedagic Fri, 18 Apr 2008 04:17:58 -0700

Hi,

On Fri, Apr 18, 2008 at 11:21:04AM +0200, Raoul Bhatia [IPAX] wrote:
> hi,
> 
> i have a cluster consisting of two servers: wc01 and wc02.
> no stonith is enabled.
> 
> i started wc02. wc02 is plugged into the switch
> i started wc01 - no connection to wc02.
> 
> after the servers (and heartbeat) is started, i plug wc01 into the
> switch. the two node find eachother but remain in split-brain mode.


The strange thing is that the report you sent is from wc02, but
then the syslog messages are as if coming from wc01:

Apr 17 13:49:42 wc01 heartbeat: [2416]: info: Link wc02:eth0.1 up.
Apr 17 13:50:12 wc01 heartbeat: [2416]: WARN: node wc01: is dead
...
Apr 17 13:50:12 wc01 ccm: [2478]: info: Hostname: wc02

What gives? What's on the other node (the report contains only
one)? The ccm constantly delivers a membership info containing
only one node. There's something really fishy about the setup.

> ok, now i tried to configure stonith (via ssh as its a test).
> 
> 1) cibadmin -U does not work. it times out and strace shows:
> > sendto(5, "i\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 113, 
> > MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 113
> > recvfrom(5, "K\0\0\0\315\253\0\0>>>\ncib_op=register\ncib_"..., 4048, 
> > MSG_DONTWAIT, NULL, NULL) = 83
> > poll([{fd=5, events=0}], 1, 0)          = 0
> > recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0
> > recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0
> > recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0
> > recvfrom(5, 0x561deb, 3965, 64, 0, 0)   = -1 EAGAIN (Resource temporarily 
> > unavailable)
> > poll([{fd=5, events=0}], 1, 0)          = 0
> > brk(0x589000)                           = 0x589000
> > brk(0x5aa000)                           = 0x5aa000
> > brk(0x5cb000)                           = 0x5cb000

The cib refuses updates from wc01, because it thinks that wc01 is
not in the membership, but wc02. It's again the node name
confusion. The cluster thinks the node is wc02, but your cibadmin
comes from wc01.

> now i unplug wc01. wc02 notices this. now i tried cibadmin -U without
> success. i then tried to restart heartbeat - this does not work.

> the strange thing is:
> 
> > # crm_mon -1|grep DC
> > Current DC: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396)
> 
> and
> > # tail -n 4 /var/log/ha-log
> > heartbeat[2416]: 2008/04/17_14:16:40 info: killing /usr/lib/heartbeat/crmd 
> > process group 2483 with signal 15
> > crmd[2483]: 2008/04/17_14:16:40 info: crm_shutdown: Requesting shutdown
> > crmd[2483]: 2008/04/17_14:16:40 info: do_shutdown_req: Sending shutdown 
> > request to DC: <null>
> > cib[2479]: 2008/04/17_14:20:15 info: cib_stats: Processed 963 operations 
> > (6573.00us average, 1% utilization) in the last 10min
> 
> what is the problem? is this a pacemaker or linux-ha problem, that the
> cluster is reacting this way?

There's not much in the log after this. Can't say what's going
on. You should verify that your setup is sane.

Thanks,

Dejan

> 
> 
> cheers,
> raoul
> 
> > # dpkg -l|egrep -i "(heartbeat|stonith|pacemaker)"
> > ii  heartbeat                  2.1.3-18                             
> > Subsystem for High-Availability Linux
> > ii  heartbeat-2                2.1.3-18                             
> > Subsystem for High-Availability Linux
> > ii  libstonith0                2.1.3-18                             
> > Interface for remotely powering down a node 
> > ii  pacemaker                  0.6.2-1                              
> > High-Availability cluster resource manager f
> > ii  stonith                    2.1.3-18                             
> > Interface for remotely powering down a node 
> -- 
> ____________________________________________________________________
> DI (FH) Raoul Bhatia M.Sc.          email.          [EMAIL PROTECTED]
> Technischer Leiter
> 
> IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
> Barawitzkagasse 10/2/2/11           email.            [EMAIL PROTECTED]
> 1190 Wien                           tel.               +43 1 3670030
> FN 277995t HG Wien                  fax.            +43 1 3670030 15
> ____________________________________________________________________
> 
> 
> 


> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-- 
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Strange split-brain behavior

Reply via email to