Dejan Muhamedagic wrote: >> Before running hb_report, I executed this preparation steps: >> >> 1. Stop the cfengine process (which monitors the existence of a >> heartbeat process and tries to start one up in case of absence); >> 2. Generate ssh keys for the root user on both machines, and make sure >> that the ~root/.ssh/authorized_keys had the appropriated keys and >> configuration to allow a root login from the other host; >> 3. Take the heartbeat to a full halt; >> 4. Poke logrotate with --force and rotate all the log files; >> 5. Take the heartbeat process up again; > > hb_report doesn't require restarting heartbeat or rotating logs. > In fact, sometimes rotating logs may hide important log messages > from it.
Thanks for the advice, Dejan. I had an experience before with hb_report, and the way it collects log files sometimes is not ideal. I had a large log file in the past and the bug was "hidden" far away, and I needed to truncate the log files and reproduce the problem on a "clean" installation before it worked as expected. That's why I followed this procedure. As I am relaying on logrotate to keep my lofgile sizes sane now, I will keep your suggestion in mind for the next time. >> At this point, I noticed that there was no errors anymore. >> >> I am really confused. Can someone here please explain to me what did I >> do wrong to start with? > > No idea. Perhaps the software got confused as well ;-) I managed to reproduce the problem by sending SIGKILL to the heartbeat processes on one of the hosts. After two attempts and a couple of minutes of log-reading, I realized that for some weird reason the host was not being stonith'ed. It toke me another check to realize that heartbeat's definition of "hostname" is ambiguous: it may mean $(hostname --fqdn), or $(hostname --short), or $(uname --nodename). Sometimes, it works with $(hostname -s), and sometimes you really need $(hostname -f) to make it work. I still have the warn and errors, and it won't go away even with "crm_resource -C". I believe that this is connected to the fact that there's no eligible candidate to run the STONITH resource for the node we just STONITH'ed (due to the recommended constraints). That causes errors like crm_verify[24809]: 2008/11/25_13:54:57 ERROR: unpack_rsc_op: Remapping db-sql1-shooter_start_0 (rc=1) on hostname.domainname to an ERROR crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: Processing failed op db-sql1-shooter_start_0 on hostname.domainame: Error crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: Compatability handling for failed op db-sql1-shooter_start_0 on hostname.domainname I am posting this here hoping that someone else searching through the list archives find this useful. I still testing to see if the advertised functionality is there, so if anyone here spot something I am doing wrong, please help me pointing me to it. Once more, thanks to everybody that helped me with that. Regards -- Luis Motta Campos is a software engineer, Perl Programmer, foodie and photographer. _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
