Hi,

On Tue, Nov 25, 2008 at 01:58:35PM +0100, Luis Motta Campos wrote:
> Dejan Muhamedagic wrote:
> >> Before running hb_report, I executed this preparation steps:
> >>
> >> 1. Stop the cfengine process (which monitors the existence of a
> >> heartbeat process and tries to start one up in case of absence);
> >> 2. Generate ssh keys for the root user on both machines, and make sure
> >> that the ~root/.ssh/authorized_keys had the appropriated keys and
> >> configuration to allow a root login from the other host;
> >> 3. Take the heartbeat to a full halt;
> >> 4. Poke logrotate with --force and rotate all the log files;
> >> 5. Take the heartbeat process up again;
> > 
> > hb_report doesn't require restarting heartbeat or rotating logs.
> > In fact, sometimes rotating logs may hide important log messages
> > from it.
> 
> Thanks for the advice, Dejan. I had an experience before with hb_report,
> and the way it collects log files sometimes is not ideal.

Most probably it's not.

> I had a large
> log file in the past and the bug was "hidden" far away, and I needed to
> truncate the log files

But you shouldn't need to do this. If the logs are large, it will
take a while, but a log segment corresponding to the given time
specification should be found. If it doesn't, please file a
bugzilla.

> and reproduce the problem on a "clean"
> installation before it worked as expected. That's why I followed this
> procedure.
> 
> As I am relaying on logrotate to keep my lofgile sizes sane now, I will
> keep your suggestion in mind for the next time.
> 
> >> At this point, I noticed that there was no errors anymore.
> >>
> >> I am really confused. Can someone here please explain to me what did I
> >> do wrong to start with?
> > 
> > No idea. Perhaps the software got confused as well ;-)
> 
> I managed to reproduce the problem by sending SIGKILL to the heartbeat
> processes on one of the hosts. After two attempts and a couple of
> minutes of log-reading, I realized that for some weird reason the host
> was not being stonith'ed.
> 
> It toke me another check to realize that heartbeat's definition of
> "hostname" is ambiguous: it may mean $(hostname --fqdn), or $(hostname
> --short), or $(uname --nodename). Sometimes, it works with $(hostname
> -s), and sometimes you really need $(hostname -f) to make it work.

The one which should be used is uname -n. At any rate, names in
the hostlist in the stonith conf should match your node names.

> I still have the warn and errors, and it won't go away even with
> "crm_resource -C". I believe that this is connected to the fact that
> there's no eligible candidate to run the STONITH resource for the node
> we just STONITH'ed  (due to the recommended constraints).

Most probably.

> That causes
> errors like
> 
> crm_verify[24809]: 2008/11/25_13:54:57 ERROR: unpack_rsc_op: Remapping
> db-sql1-shooter_start_0 (rc=1) on hostname.domainname to an ERROR
> crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op: Processing
> failed op db-sql1-shooter_start_0 on hostname.domainame: Error
> crm_verify[24809]: 2008/11/25_13:54:57 WARN: unpack_rsc_op:
> Compatability handling for failed op db-sql1-shooter_start_0 on
> hostname.domainname

Now I understand that at this time there's only one node up.

> I am posting this here hoping that someone else searching through the
> list archives find this useful.
> 
> I still testing to see if the advertised functionality is there, so if
> anyone here spot something I am doing wrong, please help me pointing me
> to it.
> 
> Once more, thanks to everybody that helped me with that.

You're welcome.

Dejan

> Regards
> -- 
> Luis Motta Campos is a software engineer,
> Perl Programmer, foodie and photographer.
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to