Hi Dejan,

As suggested, I've filed a bug report and also attached the remaining logs
there:
http://developerbugs.linux-foundation.org/show_bug.cgi?id=1831

> Strange timestamps. Which node went down? And when? Also,
> rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
> not as up: probably at some point rbxw02:eth0 went down. It would
> be interesting to see logs from the other nodes. Don't know why
> hb_report didn't pack them.
>
> Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
> Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
> Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.

The stopped node was rbxd02, which was shortly after heartbeat was started
(after I've experienced the problem, I've cleaned up the /var/lib/crm/
directory, copied the cib.xml there again, started up the whole cluster and
tried the same procedure again).
Maybe I've messed up with the times given to hb_report, but when I look at
this, I guess rbxw01 was the last node which joined to cluster (19:10).
Then I shut down rbxd02 (19:10) and it rejoined at 19:13 - when things
started to mess up.

Additionally, it's hard to believe for me that the rejoining rbxd02 can
cause real network problems between, say, rbxd01 and rbxw02...

Two questions:
1.) When I get the next maintenance window, what can I do, to get more
debug/helpful information? ('debug' in ha.cf, hb_report, additional logs,
...)?
2.) Now there's a mailing list discussion and a bug report. To ease the
discussion, one should be closed I think - which one?

Mit freundlichen Grüßen / Best regards

Andreas MATHER
ESLT - Enterprise Services for Linux Technologies

IBM Austria, Obere Donaustrasse 95, 1020 Vienna
Phone : +43-1-21145/4799
Fax: +43-1-21145/8888
e-mail: [EMAIL PROTECTED]

IBM Österreich Internationale Büromaschinen Gesellschaft m.b.H.
Sitz: Wien
Firmenbuchgericht: Handelsgericht Wien, FN 80000y


                                                                       
             Dejan Muhamedagic                                         
             <[EMAIL PROTECTED]                                         
             .fm>                                                       To
             Sent by:                  High-Availability Linux Development
             linux-ha-dev-boun         List                            
             [EMAIL PROTECTED]         <[email protected]>
             a.org                                                      cc
                                                                       
                                                                   Subject
             02/11/2008 07:25          Re: [Linux-ha-dev] hb report -  
             PM                        troubles on 4 node cluster      
                                                                       
                                                                       
             Please respond to                                         
             High-Availability                                         
             Linux Development                                         
                   List                                                
             <[EMAIL PROTECTED]                                         
             ts.linux-ha.org>                                          
                                                                       
                                                                       




Hi Andreas,

On Sun, Feb 10, 2008 at 09:38:45PM +0100, Andreas Mather1 wrote:
> ***********************
> Warning: Your file, report_1.tar.gz, contains more than 32 files after
decompression and cannot be scanned.
> ***********************
>
>
>
>
> Hi all,
>
> Please find attached a hb_report for a problem I experienced when
> implementing heartbeat.
>
> The environment:
> It's an asymmetric 4 node cluster, running heartbeat 2.1.3. All nodes
share
> a couple of filesystems, all GPFS formatted. Services inlcude WebSphere
> (modified RA), DB2 (modified RA), vsftpd (Xinetd), samba, nfs, MCS (self
> written RA), IHS and are put in 4 groups (filesvc, mcs, was, db). Dejan
is
> also familiar with the setup.
> OS: SLES 9.3 (x86_64)
> hearbeat: build via ./ConfigureMe package
>
>
> The Problem:
> In general, everything works fine (crm_standby works for every node,
etc.),
> but, when I simulate a power loss of one node (via IBM RSA)*, a cluster
> split occurs when this node rejoins. Suddenly, on every node, crm_mon
shows
> the node it is running on as 'online' while reporting the other nodes as
> 'OFFLINE'. After 1 - 2 min. the cluster is fully operational again (all
> nodes found themself again), but it seems as every resource gets
restarted.
>
> Please let me know, if I can provide further information.

>From the log on rbxw02:

Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 up.
Feb 10 19:10:16 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 up.
Feb 10 19:10:32 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 dead.
Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth0 dead.
Feb 10 19:13:16 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 dead.
Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth0 dead.
Feb 10 19:13:17 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 dead.
Feb 10 19:15:06 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxw01
returning after partition.
Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxw01:eth2 up.
Feb 10 19:15:06 rbxw02 heartbeat: [22769]: info: Link rbxd02:eth2 up.
Feb 10 19:15:07 rbxw02 heartbeat: [22769]: CRIT: Cluster node rbxd01
returning after partition.
Feb 10 19:15:07 rbxw02 heartbeat: [22769]: info: Link rbxd01:eth2 up.

Strange timestamps. Which node went down? And when? Also,
rbxd02:eth0 was not reported as down and rbxw01:eth0 rbxd01:eth0
not as up: probably at some point rbxw02:eth0 went down. It would
be interesting to see logs from the other nodes. Don't know why
hb_report didn't pack them.

Two extra nodes went DC around 19:13 for about two minutes, which
means that there were three partitions: w02,d02 and w01 and d01.
Note that none of them had quorum.

Looks like a network problem, but an awkward one. Don't know how
it got disrupted this much. Perhaps you could try with unicast:
replace each mcast directive with four ucast directives.

Cheers,

Dejan

> Thanks,
>
> Andreas
>
>
> * Sorry, I forgot to test what happens, when I just stop and start
> heartbeat on that node - would be useful too, I think... :(
>
>
>
>
> (See attached file: report_1.tar.gz)
>
> Mit freundlichen Gr??en / Best regards
>
> Andreas MATHER
> ESLT - Enterprise Services for Linux Technologies
>
> IBM Austria, Obere Donaustrasse 95, 1020 Vienna
> Phone : +43-1-21145/4799
> Fax: +43-1-21145/8888
> e-mail: [EMAIL PROTECTED]
>
> IBM ?sterreich Internationale B?romaschinen Gesellschaft m.b.H.
> Sitz: Wien
> Firmenbuchgericht: Handelsgericht Wien, FN 80000y


> _______________________________________________________
> Linux-HA-Dev: [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
> Home Page: http://linux-ha.org/

_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


_______________________________________________________
Linux-HA-Dev: [email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/

Reply via email to