Hi David,

On Wed, Jun 09, 2010 at 04:37:28PM -0700, David Birdsong wrote:
> I'm pretty excited to start using halog, but dumping out the usage is
> about the only documentation I can turn up -which is not explaining
> anything to me.  Is there anything more substantial on how to use
> halog?

you're right. At the beginning, it was just a tool to help me spot
production issues, then I have added features and explained a few
people how to use it. But obviously some doc is missing.

I'll be quick here but I hope it will help you to start. First, you
should see it as a haproxy-specific grep with a few enhanced filters
and outputs. It can only do one output format at a time, but you can
combine several input filters.

Input filters :
   -e : only consider lines which don't report an error (timeout, connect, 5xx, 
...)

   -E : only consider lines which do report an error (timeout, connect, 5xx, 
...)

   -rt XXX : only consider lines with server response times higher than XXX ms

   -RT XXX : only consider lines with server response times lower than XXX ms

   -ad XXX : only consider lines which indicate an accept time after a silence 
of XXX ms

   -ac XXX : to be used with -ad, only consider those lines if at least XXX 
lines are
             grouped after the silence.

   -v : invert the selection

Some filters are incompatible. You can have only one of -e and -E, and you can
have only one of -rt and -RT.

Since some syslogs add a field for the sender's host and others don't, you can
adjust the fields offsets with -s. By default, "-s 1" is assumed, to skip one
field for the origin host. You can use -s 0 if your syslog does not add it (or
if you use netcat to log). Or you can use -s 2 if your syslog adds other fields.
Negative values are also permitted if that help.

The output format can be selected with the following flags :

   -q  : don't show a warning for unparsable lines (eg: "server XXX is UP")

   -c  : only report the number of lines which match

   -gt : outputs a list of x,y values to be used with gnuplot to visually
         check if everything's OK. It was its first use, but it's not used
         anymore, as it was not very convenient to export values.

   -pct: report a percentile table of request time, connect time, response
         time, data time. The output contains the percent and absolute number
         of requests served in less than XXX ms for each field. It's very
         helpful to quickly spot TCP retransmits because you can see if you
         have large 3 seconds steps. Also, it is convenient to use on prod
         when you suspect a site is slow. Just a quick check and you can
         tell if your timers are slower than other days.

   -st : report the distribution of the status codes (200, 302, ...). Again,
         this is meant as a quick help. You run that when you suspect an
         issue and you immediately see if some files are missing (404) or 
         some errors are reported.

   -srv: enumerate all servers found in the logs with their respective
         status codes distribution (2xx, 3xx, 4xx, 5xx), the number of
         errors (-1 anywhere in a timer), the error ratio, the average
         response time (without data) and the average connect time.


   -ad and -ac provide a special output. I don't remember the format, they
         were developped to track an issue with huge packet losses, I seem
         to remember they only report the time of the accept of requests
         matching the criteria, the length of the silence as well as the
         number of requests accepted at once. The goal was to find abnormally
         long silences. For instance, if you have a load between 500 and
         2000 hits/s 24h a day, you're almost certain that a one second
         silence indicates an issue. Being able to spot the end of silences
         and compare them on several machines helps find the origin of the
         trouble (switch, machine swapping, etc...)

In practice, you generally just want to run -st when you think you may be
encountering a trouble. If you see an abnormal error distribution, then
you'll rerun with -srv to find what server is the culprit (if any). I know
some people who run that continuously coupled with a tail -5000. That way
they get a realtime stats distribution for their servers.

The percentile output is more to be used on full day logs, it helps check
how heavy days compare with calm ones in terms of response times. But it
can be used by prod people to quickly check if there are any errors. At
least from what I have observed, sometimes people are not sure about the
fields, but they're quite sure that two outputs don't look similar and
that one of them indicates a problem. That's already a good thing because
they can say in one second "everything looks OK to me".

Last point, I found that -rt/-RT can be used for debugging, as they help
spot abnormally long requests. In this case, you'll end up running the
tool several times in a row. I found it very convenient to first do a
"halog -e < file > /dev/shm/file" then run all research from /dev/shm/file
to ensure there's no disk activity anymore. It requires that your file
fits in /dev/shm though, which is not often the case.

Hoping this helps,
Willy


Reply via email to