Dejan Muhamedagic wrote:
>> Before running hb_report, I executed this preparation steps:
>>
>> 1. Stop the cfengine process (which monitors the existence of a
>> heartbeat process and tries to start one up in case of absence);
>> 2. Generate ssh keys for the root user on both machines, and make sure
>> that the ~root/.ssh/authorized_keys had the appropriated keys and
>> configuration to allow a root login from the other host;
>> 3. Take the heartbeat to a full halt;
>> 4. Poke logrotate with --force and rotate all the log files;
>> 5. Take the heartbeat process up again;
> 
> hb_report doesn't require restarting heartbeat or rotating logs.
> In fact, sometimes rotating logs may hide important log messages
> from it.

Thanks for the advice, Dejan. I had an experience before with hb_report,
and the way it collects log files sometimes is not ideal. I had a large
log file in the past and the bug was "hidden" far away, and I needed to
truncate the log files and reproduce the problem on a "clean"
installation before it worked as expected. That's why I followed this
procedure.

As I am relaying on logrotate to keep my lofgile sizes sane now, I will
keep your suggestion in mind for the next time.

>> At this point, I noticed that there was no errors anymore.
>>
>> I am really confused. Can someone here please explain to me what did I
>> do wrong to start with?
> 
> No idea. Perhaps the software got confused as well ;-)

I managed to reproduce the problem by sending SIGKILL to the heartbeat
processes on one of the hosts. After two attempts and a couple of
minutes of log-reading, I realized that for some weird reason the host
was not being stonith'ed.

It toke me another check to realize that heartbeat's definition of
"hostname" is ambiguous: it may mean $(hostname --fqdn), or $(hostname
--short), or $(uname --nodename). Sometimes, it works with $(hostname
-s), and sometimes you really need $(hostname -f) to make it work.

I still have the warn and errors, and it won't go away even with
"crm_resource -C". I believe that this is connected to the fact that
there's no eligible candidate to run the STONITH resource for the node
we just STONITH'ed  (due to the recommended constraints). That causes
errors like

crm_verify[24809]: 2008/11/25_13:54:57 ERROR: unpack_rsc_op: Remapping
db-sql1-shooter_start_0 (rc=1) on hostname.domainname to an ERROR
crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: Processing
failed op db-sql1-shooter_start_0 on hostname.domainame: Error
crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op:
Compatability handling for failed op db-sql1-shooter_start_0 on
hostname.domainname

I am posting this here hoping that someone else searching through the
list archives find this useful.

I still testing to see if the advertised functionality is there, so if
anyone here spot something I am doing wrong, please help me pointing me
to it.

Once more, thanks to everybody that helped me with that.

Regards
-- 
Luis Motta Campos is a software engineer,
Perl Programmer, foodie and photographer.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to