Hi, On Tue, Nov 25, 2008 at 01:58:35PM +0100, Luis Motta Campos wrote: > Dejan Muhamedagic wrote: > >> Before running hb_report, I executed this preparation steps: > >> > >> 1. Stop the cfengine process (which monitors the existence of a > >> heartbeat process and tries to start one up in case of absence); > >> 2. Generate ssh keys for the root user on both machines, and make sure > >> that the ~root/.ssh/authorized_keys had the appropriated keys and > >> configuration to allow a root login from the other host; > >> 3. Take the heartbeat to a full halt; > >> 4. Poke logrotate with --force and rotate all the log files; > >> 5. Take the heartbeat process up again; > > > > hb_report doesn't require restarting heartbeat or rotating logs. > > In fact, sometimes rotating logs may hide important log messages > > from it. > > Thanks for the advice, Dejan. I had an experience before with hb_report, > and the way it collects log files sometimes is not ideal.
Most probably it's not. > I had a large > log file in the past and the bug was "hidden" far away, and I needed to > truncate the log files But you shouldn't need to do this. If the logs are large, it will take a while, but a log segment corresponding to the given time specification should be found. If it doesn't, please file a bugzilla. > and reproduce the problem on a "clean" > installation before it worked as expected. That's why I followed this > procedure. > > As I am relaying on logrotate to keep my lofgile sizes sane now, I will > keep your suggestion in mind for the next time. > > >> At this point, I noticed that there was no errors anymore. > >> > >> I am really confused. Can someone here please explain to me what did I > >> do wrong to start with? > > > > No idea. Perhaps the software got confused as well ;-) > > I managed to reproduce the problem by sending SIGKILL to the heartbeat > processes on one of the hosts. After two attempts and a couple of > minutes of log-reading, I realized that for some weird reason the host > was not being stonith'ed. > > It toke me another check to realize that heartbeat's definition of > "hostname" is ambiguous: it may mean $(hostname --fqdn), or $(hostname > --short), or $(uname --nodename). Sometimes, it works with $(hostname > -s), and sometimes you really need $(hostname -f) to make it work. The one which should be used is uname -n. At any rate, names in the hostlist in the stonith conf should match your node names. > I still have the warn and errors, and it won't go away even with > "crm_resource -C". I believe that this is connected to the fact that > there's no eligible candidate to run the STONITH resource for the node > we just STONITH'ed (due to the recommended constraints). Most probably. > That causes > errors like > > crm_verify[24809]: 2008/11/25_13:54:57 ERROR: unpack_rsc_op: Remapping > db-sql1-shooter_start_0 (rc=1) on hostname.domainname to an ERROR > crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: Processing > failed op db-sql1-shooter_start_0 on hostname.domainame: Error > crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: > Compatability handling for failed op db-sql1-shooter_start_0 on > hostname.domainname Now I understand that at this time there's only one node up. > I am posting this here hoping that someone else searching through the > list archives find this useful. > > I still testing to see if the advertised functionality is there, so if > anyone here spot something I am doing wrong, please help me pointing me > to it. > > Once more, thanks to everybody that helped me with that. You're welcome. Dejan > Regards > -- > Luis Motta Campos is a software engineer, > Perl Programmer, foodie and photographer. > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
