Hi David,
On Wed, Jun 09, 2010 at 04:37:28PM -0700, David Birdsong wrote:
> I'm pretty excited to start using halog, but dumping out the usage is
> about the only documentation I can turn up -which is not explaining
> anything to me. Is there anything more substantial on how to use
> halog?
you're right. At the beginning, it was just a tool to help me spot
production issues, then I have added features and explained a few
people how to use it. But obviously some doc is missing.
I'll be quick here but I hope it will help you to start. First, you
should see it as a haproxy-specific grep with a few enhanced filters
and outputs. It can only do one output format at a time, but you can
combine several input filters.
Input filters :
-e : only consider lines which don't report an error (timeout, connect, 5xx,
...)
-E : only consider lines which do report an error (timeout, connect, 5xx,
...)
-rt XXX : only consider lines with server response times higher than XXX ms
-RT XXX : only consider lines with server response times lower than XXX ms
-ad XXX : only consider lines which indicate an accept time after a silence
of XXX ms
-ac XXX : to be used with -ad, only consider those lines if at least XXX
lines are
grouped after the silence.
-v : invert the selection
Some filters are incompatible. You can have only one of -e and -E, and you can
have only one of -rt and -RT.
Since some syslogs add a field for the sender's host and others don't, you can
adjust the fields offsets with -s. By default, "-s 1" is assumed, to skip one
field for the origin host. You can use -s 0 if your syslog does not add it (or
if you use netcat to log). Or you can use -s 2 if your syslog adds other fields.
Negative values are also permitted if that help.
The output format can be selected with the following flags :
-q : don't show a warning for unparsable lines (eg: "server XXX is UP")
-c : only report the number of lines which match
-gt : outputs a list of x,y values to be used with gnuplot to visually
check if everything's OK. It was its first use, but it's not used
anymore, as it was not very convenient to export values.
-pct: report a percentile table of request time, connect time, response
time, data time. The output contains the percent and absolute number
of requests served in less than XXX ms for each field. It's very
helpful to quickly spot TCP retransmits because you can see if you
have large 3 seconds steps. Also, it is convenient to use on prod
when you suspect a site is slow. Just a quick check and you can
tell if your timers are slower than other days.
-st : report the distribution of the status codes (200, 302, ...). Again,
this is meant as a quick help. You run that when you suspect an
issue and you immediately see if some files are missing (404) or
some errors are reported.
-srv: enumerate all servers found in the logs with their respective
status codes distribution (2xx, 3xx, 4xx, 5xx), the number of
errors (-1 anywhere in a timer), the error ratio, the average
response time (without data) and the average connect time.
-ad and -ac provide a special output. I don't remember the format, they
were developped to track an issue with huge packet losses, I seem
to remember they only report the time of the accept of requests
matching the criteria, the length of the silence as well as the
number of requests accepted at once. The goal was to find abnormally
long silences. For instance, if you have a load between 500 and
2000 hits/s 24h a day, you're almost certain that a one second
silence indicates an issue. Being able to spot the end of silences
and compare them on several machines helps find the origin of the
trouble (switch, machine swapping, etc...)
In practice, you generally just want to run -st when you think you may be
encountering a trouble. If you see an abnormal error distribution, then
you'll rerun with -srv to find what server is the culprit (if any). I know
some people who run that continuously coupled with a tail -5000. That way
they get a realtime stats distribution for their servers.
The percentile output is more to be used on full day logs, it helps check
how heavy days compare with calm ones in terms of response times. But it
can be used by prod people to quickly check if there are any errors. At
least from what I have observed, sometimes people are not sure about the
fields, but they're quite sure that two outputs don't look similar and
that one of them indicates a problem. That's already a good thing because
they can say in one second "everything looks OK to me".
Last point, I found that -rt/-RT can be used for debugging, as they help
spot abnormally long requests. In this case, you'll end up running the
tool several times in a row. I found it very convenient to first do a
"halog -e < file > /dev/shm/file" then run all research from /dev/shm/file
to ensure there's no disk activity anymore. It requires that your file
fits in /dev/shm though, which is not often the case.
Hoping this helps,
Willy