Hi David, On Wed, Jun 09, 2010 at 04:37:28PM -0700, David Birdsong wrote: > I'm pretty excited to start using halog, but dumping out the usage is > about the only documentation I can turn up -which is not explaining > anything to me. Is there anything more substantial on how to use > halog?
you're right. At the beginning, it was just a tool to help me spot production issues, then I have added features and explained a few people how to use it. But obviously some doc is missing. I'll be quick here but I hope it will help you to start. First, you should see it as a haproxy-specific grep with a few enhanced filters and outputs. It can only do one output format at a time, but you can combine several input filters. Input filters : -e : only consider lines which don't report an error (timeout, connect, 5xx, ...) -E : only consider lines which do report an error (timeout, connect, 5xx, ...) -rt XXX : only consider lines with server response times higher than XXX ms -RT XXX : only consider lines with server response times lower than XXX ms -ad XXX : only consider lines which indicate an accept time after a silence of XXX ms -ac XXX : to be used with -ad, only consider those lines if at least XXX lines are grouped after the silence. -v : invert the selection Some filters are incompatible. You can have only one of -e and -E, and you can have only one of -rt and -RT. Since some syslogs add a field for the sender's host and others don't, you can adjust the fields offsets with -s. By default, "-s 1" is assumed, to skip one field for the origin host. You can use -s 0 if your syslog does not add it (or if you use netcat to log). Or you can use -s 2 if your syslog adds other fields. Negative values are also permitted if that help. The output format can be selected with the following flags : -q : don't show a warning for unparsable lines (eg: "server XXX is UP") -c : only report the number of lines which match -gt : outputs a list of x,y values to be used with gnuplot to visually check if everything's OK. It was its first use, but it's not used anymore, as it was not very convenient to export values. -pct: report a percentile table of request time, connect time, response time, data time. The output contains the percent and absolute number of requests served in less than XXX ms for each field. It's very helpful to quickly spot TCP retransmits because you can see if you have large 3 seconds steps. Also, it is convenient to use on prod when you suspect a site is slow. Just a quick check and you can tell if your timers are slower than other days. -st : report the distribution of the status codes (200, 302, ...). Again, this is meant as a quick help. You run that when you suspect an issue and you immediately see if some files are missing (404) or some errors are reported. -srv: enumerate all servers found in the logs with their respective status codes distribution (2xx, 3xx, 4xx, 5xx), the number of errors (-1 anywhere in a timer), the error ratio, the average response time (without data) and the average connect time. -ad and -ac provide a special output. I don't remember the format, they were developped to track an issue with huge packet losses, I seem to remember they only report the time of the accept of requests matching the criteria, the length of the silence as well as the number of requests accepted at once. The goal was to find abnormally long silences. For instance, if you have a load between 500 and 2000 hits/s 24h a day, you're almost certain that a one second silence indicates an issue. Being able to spot the end of silences and compare them on several machines helps find the origin of the trouble (switch, machine swapping, etc...) In practice, you generally just want to run -st when you think you may be encountering a trouble. If you see an abnormal error distribution, then you'll rerun with -srv to find what server is the culprit (if any). I know some people who run that continuously coupled with a tail -5000. That way they get a realtime stats distribution for their servers. The percentile output is more to be used on full day logs, it helps check how heavy days compare with calm ones in terms of response times. But it can be used by prod people to quickly check if there are any errors. At least from what I have observed, sometimes people are not sure about the fields, but they're quite sure that two outputs don't look similar and that one of them indicates a problem. That's already a good thing because they can say in one second "everything looks OK to me". Last point, I found that -rt/-RT can be used for debugging, as they help spot abnormally long requests. In this case, you'll end up running the tool several times in a row. I found it very convenient to first do a "halog -e < file > /dev/shm/file" then run all research from /dev/shm/file to ensure there's no disk activity anymore. It requires that your file fits in /dev/shm though, which is not often the case. Hoping this helps, Willy