On Wed, Jun 9, 2010 at 10:09 PM, Willy Tarreau <w...@1wt.eu> wrote:
> Hi David,
>
> On Wed, Jun 09, 2010 at 04:37:28PM -0700, David Birdsong wrote:
>> I'm pretty excited to start using halog, but dumping out the usage is
>> about the only documentation I can turn up -which is not explaining
>> anything to me.  Is there anything more substantial on how to use
>> halog?
>
> you're right. At the beginning, it was just a tool to help me spot
> production issues, then I have added features and explained a few
> people how to use it. But obviously some doc is missing.
>
> I'll be quick here but I hope it will help you to start. First, you
> should see it as a haproxy-specific grep with a few enhanced filters
> and outputs. It can only do one output format at a time, but you can
> combine several input filters.
>
> Input filters :
>   -e : only consider lines which don't report an error (timeout, connect, 
> 5xx, ...)
>
>   -E : only consider lines which do report an error (timeout, connect, 5xx, 
> ...)
>
>   -rt XXX : only consider lines with server response times higher than XXX ms
>
>   -RT XXX : only consider lines with server response times lower than XXX ms
>
>   -ad XXX : only consider lines which indicate an accept time after a silence 
> of XXX ms
>
>   -ac XXX : to be used with -ad, only consider those lines if at least XXX 
> lines are
>             grouped after the silence.
>
>   -v : invert the selection
>
> Some filters are incompatible. You can have only one of -e and -E, and you can
> have only one of -rt and -RT.
>
> Since some syslogs add a field for the sender's host and others don't, you can
> adjust the fields offsets with -s. By default, "-s 1" is assumed, to skip one
> field for the origin host. You can use -s 0 if your syslog does not add it (or
> if you use netcat to log). Or you can use -s 2 if your syslog adds other 
> fields.
> Negative values are also permitted if that help.
>
> The output format can be selected with the following flags :
>
>   -q  : don't show a warning for unparsable lines (eg: "server XXX is UP")
>
>   -c  : only report the number of lines which match
>
>   -gt : outputs a list of x,y values to be used with gnuplot to visually
>         check if everything's OK. It was its first use, but it's not used
>         anymore, as it was not very convenient to export values.
>
>   -pct: report a percentile table of request time, connect time, response
>         time, data time. The output contains the percent and absolute number
>         of requests served in less than XXX ms for each field. It's very
>         helpful to quickly spot TCP retransmits because you can see if you
>         have large 3 seconds steps. Also, it is convenient to use on prod
>         when you suspect a site is slow. Just a quick check and you can
>         tell if your timers are slower than other days.
>
>   -st : report the distribution of the status codes (200, 302, ...). Again,
>         this is meant as a quick help. You run that when you suspect an
>         issue and you immediately see if some files are missing (404) or
>         some errors are reported.
>
>   -srv: enumerate all servers found in the logs with their respective
>         status codes distribution (2xx, 3xx, 4xx, 5xx), the number of
>         errors (-1 anywhere in a timer), the error ratio, the average
>         response time (without data) and the average connect time.
>
>
>   -ad and -ac provide a special output. I don't remember the format, they
>         were developped to track an issue with huge packet losses, I seem
>         to remember they only report the time of the accept of requests
>         matching the criteria, the length of the silence as well as the
>         number of requests accepted at once. The goal was to find abnormally
>         long silences. For instance, if you have a load between 500 and
>         2000 hits/s 24h a day, you're almost certain that a one second
>         silence indicates an issue. Being able to spot the end of silences
>         and compare them on several machines helps find the origin of the
>         trouble (switch, machine swapping, etc...)
>
wow, thanks for the run-down.  there's a lot here; plenty to get me started.

> In practice, you generally just want to run -st when you think you may be
> encountering a trouble. If you see an abnormal error distribution, then
> you'll rerun with -srv to find what server is the culprit (if any). I know
> some people who run that continuously coupled with a tail -5000. That way
> they get a realtime stats distribution for their servers.
>
thanks, -srv i think is what i've been hoping for to track down bad
backends in a backend section that has roughly 400 servers.

> The percentile output is more to be used on full day logs, it helps check
> how heavy days compare with calm ones in terms of response times. But it
> can be used by prod people to quickly check if there are any errors. At
> least from what I have observed, sometimes people are not sure about the
> fields, but they're quite sure that two outputs don't look similar and
> that one of them indicates a problem. That's already a good thing because
> they can say in one second "everything looks OK to me".
>
> Last point, I found that -rt/-RT can be used for debugging, as they help
> spot abnormally long requests. In this case, you'll end up running the
> tool several times in a row. I found it very convenient to first do a
> "halog -e < file > /dev/shm/file" then run all research from /dev/shm/file
> to ensure there's no disk activity anymore. It requires that your file
> fits in /dev/shm though, which is not often the case.
>
actually, we've configured syslog-ng to log to /dev/shm already ;)

> Hoping this helps,
> Willy
>
>

Reply via email to