Hi Krzysztof,

On Mon, Aug 31, 2009 at 09:27:40PM +0200, Krzysztof Piotr Oledzki wrote:
> >From cbfda2a0808cd5d5cbf17ec7f0d04f7091bec9cb Mon Sep 17 00:00:00 2001
> From: Krzysztof Piotr Oledzki <[email protected]>
> Date: Mon, 31 Aug 2009 21:04:02 +0200
> Subject: [MEDIUM] Collect & show information about last health check
> 
> Collect information about last health check result,
> including L5-7 code if possible (for example http or smtp
> return code) and time took to finish last check.
> 
> Health check info is provided on both stats pages (html & csv)
> and logged when a server is marekd UP or DOWN. Currently active
> check are marked with an asterisk, but only in html mode.

I have tested this a bit, and I must say this is awesome. The idea
of marking a currently active check is really smart too !

However, I found that it was hard to understand the status codes in
the HTML stats page. Some people are already complaining about columns
they don't trivially understand, but here I think that status codes
are close to cryptic. Also, while it is not *that* hard to tell
which one means what when you can compare all of them, they must be
unambiguous when found individually.

I could propose some adjustments, but it's just a proposal, I'm open
to other suggestions :

> Currently there are 9 status:
>  UNK       -> unknown

>From my tests, this means this one is not tested. Maybe we should
not indicate anything at all then, or just report "NONE" ? Also we
would not want the "in 0 ms" indicator either for this case.

>  SOCKERR   -> socket error

Good idea to detect that one.

>  L14OK     -> check passed on layer 1-4, no upper layers testing enabled
>  L14TMOUT  -> layer 1-4 timeout
>  L14UNR    -> layer 1-4 unreachable, for example
>          "Connection refused" (tcp rst) or "No route to host" (icmp)

I'd suggest "L4" instead of "L14". And also after all, layer 4 is what
we're aware of, and people are used to associate it to TCP (eventhough
TCP is rather layer 5 if we're purist). I'd suggest shortening the
timeout label, something like L4TIM, L4TOUT, L4TMO, ... I think the
former is more intuitive.

Would you agree to rename L14UNR to L4CON, to indicate a failure to
connect ? In fact, "unreachable" is a bit too precise and not necessarily
right. It's common to speak about unreachability in terms of routing,
and I find it confusing to read UNR when a check fails on my own machine.
Also, if one day we get more precise reports by exploiting errno a bit
better, we will then be able to report a real unreachable error.

>  L57OK     -> check passed on layer 5-7
>  L57TMOUT  -> layer 5-7 (HTTP/SMTP/SSL) timeout
>  L57INVRSP -> layer 5-7 invalid response - protocol error
>  L57RSPERR -> layer 5-7 response error, for example HTTP 5xx

Same principle here, I'd suggest L7 instead of L57 for the same reasons
as above. Let's shorten the TMOUT one too. We could also turn L57INVRSP
into L7INV.

The response error indicates that the server was reached but that its
status code was not accepted. Would you be OK with L7STS then ?

I also suggest that we reduce the column name "Last check" to
"LastChk" to prevent the browser from wrapping it over two lines.

During my tests, I got a SOCKERR while the server was stopped. I
think it's an asynchronous "connection refused" which triggered
this error, probably here (but I may be wrong) :

@@ -370,8 +468,10 @@ static int event_srv_chk_w(int fd)
                        }
                        else if (ret == 0 || errno == EAGAIN)
                                goto out_poll;
-                       else
+                       else {
+                               set_server_check_status(s, HCHK_STATUS_SOCKERR);
                                goto out_error;
+                       }
                }

I also found a missing comma at the end of the CSV title (you
remember, we always get this wrong) :

> @@ -222,6 +224,7 @@ int print_csv_header(struct chunk *msg, int size)
>                           "chkfail,chkdown,lastchg,downtime,qlimit,"
>                           "pid,iid,sid,throttle,lbtot,tracked,type,"
>                           "rate,rate_lim,rate_max,"
> +                         "check_status,check_code,check_duration"
>                           "\n");

You should add one after 'check_duration', as all lines in the CSV are
terminated with a comma.

So please tell me what you think about the comments above, if you want
to repost an updated patch or if you prefer that I adjust yours, etc...

Thanks!
Willy



Reply via email to