On Monday 14 May 2007, Bernhard Limbach wrote: > Hi, > > Update: > > The runlevel thing was not a solution (I assumed kind of a racing condition > and thought the deferred start of heartbeat would solve this) but it happened > again. > > I know have the suspicion that it is caused by the missing hb_generation file > on the freshly installed server. After re-reading the docs I was wondering > why I didn't run into a reply attack protection anyway ?? I made the same tests as you did and having hb_generation to standard value "file" i always run into the reply attack protection error which lead to the emergency shutdown (after some pipes run full).
As you explained below setting it to "time" fixed my problems as well. I strongly suggest something has to be done about this "replacement" problem. It is one of the most common situations in a cluster environment and it is really bad if replacing the broken node shuts down the only working node ... > > I tried now with "hbgenmethod time" and my reinstall-procedure now succeeded > without coredump (at least one try, I'm getting tired of doing this > installation thing all the time again...). I tried it a couple times (3 to 4 times) now with 2.0.8 and hb_genmethod set to time and it worked fine for me. > > Still I have a little concern that a new joining node, whether legal or not > and whether it behaves nice or not, can cause my running heartbeat to fail in > this dramatic way... > > Regards, > Bernhard > > > > -------- Original-Nachricht -------- > Datum: Fri, 11 May 2007 13:07:28 +0200 > Von: "Bernhard Limbach" <[EMAIL PROTECTED]> > An: [email protected] > Betreff: [Linux-HA] Coredump on active node when other node joins in > > > Hi, > > > > I'm currently practicing the reinstallation of one cluster node > > (maintenance procedure to replace a server), while the other node is > > running and > > providing the services. > > > > When the freshly installed node comes up, heartbeat on the primary node > > dumps core and does an emergency shutdown. > > > > Freshly installed means that in addition to the config files the only file > > in /var/lib/heartbeat and below, that I have restored, is the file > > hb_uuid. Everything else there should be automatically updated, as far as I > > have > > understood the concepts... > > > > The error happened (reproducably) when heartbeat was started in runlevel > > 2. > > > > When started in runlevel 5 it did not happen (that's now my current > > workaround). > > > > > > The error also did not happen when one of the nodes was rebootet normally, > > i.e. after it has been online in the cluster already. > > > > > > The setup is a simple 2-node cluster with: > > - heartbeat-2.0.8 compiled from the tarball that is available on the > > download page. > > - Fedora Core 5 with kernel 2.6.20-1.2316.fc5smp > > > > > > Attached you will find: ha.cf, cib.xml, the logs of both nodes and the > > backtrace of the core-dump (if I managed to extract it correctly...). > > > > Please note also that after the emergency shutdown two heartbeat processes > > still were running: > > > > DMM1:/root # ps -ef |grep heartbeat > > root 17535 1 0 07:31 ? 00:00:00 /usr/lib/heartbeat/lrmd -r > > 17 17537 1 0 07:31 ? 00:00:00 /usr/lib/heartbeat/attrd > > > > > > As starting of a freshly installed server in runlevel 5 is a workable > > workaround for me I merely wanted to inform you about this error, maybe it > > helps to track down another of those little bugs... > > > > Best regards, > > Bernhard > > -- > > GMX FreeMail: 1 GB Postfach, 5 E-Mail-Adressen, 10 Free SMS. > > Alle Infos und kostenlose Anmeldung: http://www.gmx.net/de/go/freemail > -- Max Hofer APUS Software G.m.b.H. A-8074 Raaba, Bahnhofstraße 1/1 T| +43 316 401629 11 F| +43 316 401629 9 W| www.apus.co.at E| [EMAIL PROTECTED] _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
