Re: [PATCH] [RFC] Decrease server health based on http responses / events, version 2

2009-12-15 Thread Willy Tarreau
On Tue, Dec 15, 2009 at 11:29:33PM +0100, Krzysztof Ol?dzki wrote:
> >>+error_limit 
> >
> >"error-limit" please :-)
> 
> Sure. Would you like a new patch with that change?

No, don't waste your time on cosmetics. I've noticed you sent v3 while
I was replying, I'll sed on v3.

(...)
> >Maybe I'm not very clear, please tell me if so.
> 
> Yes, it is exacly how you described it - we'll be able to change 
> defaults without enabling the feature itself.

Perfect then.
> 
> Currently I'm thinking about something like:
> >server default [error-limit XX] [on-error XX] [inter XX] [fastinter XX] 
> >[downinter XX] [rise XX] [fail XX]
> + maybe some other options we might find useful.

OK that would be cool. We'd just have to use a specific keyword such
as "server-default" instead of "server default" in order not to create
an ambiguous config setting (not creating a server called "default"
and start to parse it). Hmmm please also add weight, minconn and
maxconn that some people use a lot too !

(...)
> This patch should not make healht checks more CPU hungry, however 
> failed requests cost now little more. Hopefully, if health analyses 
> feature is not enabled, it is one more call to health_adjust() which 
> simply returns in the first condition.

OK that's fine. Once again, I was more bringing information from the
field related to the area being touched than expressing concerns with
this specific patch.

> >Feel free to tell me when you'd like your patch to be merged.
> >In my opinion it's already in good shape.
> 
> I have just sent the v3 version, if you accept the changes I think we 
> can merge it now, of course with respect to s/error_limit/error-limit/g.

OK I'm doing so then.

Best regards,
Willy




Re: [PATCH] [RFC] Decrease server health based on http responses / events, version 2

2009-12-15 Thread Krzysztof Olędzki

On 2009-12-15 23:07, Willy Tarreau wrote:

Hi Krzysztof,

Hi Willy,


I forgot to check the list and did not see your mail.

On Fri, Dec 11, 2009 at 10:33:56AM +0100, Krzysztof Piotr Oledzki wrote:

Changes in this version:
 - documentation
 - close race between a started check and health analysis event
 - don't force fastinter if it is not set
 - better names for options
 - layer4 support

Possibility to set/change default options is going to be implemented in
a different patch.


OK I'm fine with this. Let's keep your work in small chunks for
easier handling.


diff --git a/doc/configuration.txt b/doc/configuration.txt
index ef768f8..bdb5c78 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -4642,6 +4642,13 @@ cookie 
   the same cookie value, and it is in fact somewhat common between normal and
   backup servers. See also the "cookie" keyword in backend section.
 
+error_limit 


"error-limit" please :-)


Sure. Would you like a new patch with that change?


Well, overall this looks pretty good. I have two new questions.
You said we'd have default settings for the action and consecutive
errors so that we could enable a default behaviour by just setting
the "observe XXX" statement. I still have in mind my stupid choice
of the "stats" keyword where we have lots of default settings that
are all used once we enable any of them, making it difficult to
change them once set in a defaults instance. For instance, if you
put "stats uri /stats" in a defaults intance, you have stats enabled
everywhere.

I don't think the same issue could happen here because from my
understanding we still have to enable the behaviour by using
"observe XXX". But I'd like that you take some time to check for
a few corner cases we could fall into in case we've overseen
something. For instance, when supporting settings in defaults
instance, we don't want a simple default setting of the number
of consecutive errors to enable response checking for all servers
of all backends. We just want this to set the default value.

Maybe I'm not very clear, please tell me if so.


Yes, it is exacly how you described it - we'll be able to change 
defaults without enabling the feature itself.


Currently I'm thinking about something like:

server default [error-limit XX] [on-error XX] [inter XX] [fastinter XX] 
[downinter XX] [rise XX] [fail XX]

+ maybe some other options we might find useful.


The second point is that I've got reports of people checking
very large numbers of servers (say 500) at a very high rate
(10-30 ms) because they prefer to fail a request quickly than
to send a reqeust to a server which might return bad contents.
(For this reason, I think your patch will interest them :-))
But those people are generally experiencing high CPU usages
because of the checks alone. I'm realizing that from the
start, I've never considered the checks to be on the critical
path. In such situations it can really be a critical path, and
we must keep in mind that we should keep processing low. I've
not seen anything expensive in your patch, but it's just a
reminder to help you make choices in the future if required.


This patch should not make healht checks more CPU hungry, however 
failed requests cost now little more. Hopefully, if health analyses 
feature is not enabled, it is one more call to health_adjust() which 
simply returns in the first condition.



Feel free to tell me when you'd like your patch to be merged.
In my opinion it's already in good shape.


I have just sent the v3 version, if you accept the changes I think we 
can merge it now, of course with respect to s/error_limit/error-limit/g.


Best regards,

Krzysztof Olędzki



Re: [PATCH] [RFC] Decrease server health based on http responses / events, version 2

2009-12-15 Thread Willy Tarreau
Hi Krzysztof,

I forgot to check the list and did not see your mail.

On Fri, Dec 11, 2009 at 10:33:56AM +0100, Krzysztof Piotr Oledzki wrote:
> Changes in this version:
>  - documentation
>  - close race between a started check and health analysis event
>  - don't force fastinter if it is not set
>  - better names for options
>  - layer4 support
> 
> Possibility to set/change default options is going to be implemented in
> a different patch.

OK I'm fine with this. Let's keep your work in small chunks for
easier handling.

> diff --git a/doc/configuration.txt b/doc/configuration.txt
> index ef768f8..bdb5c78 100644
> --- a/doc/configuration.txt
> +++ b/doc/configuration.txt
> @@ -4642,6 +4642,13 @@ cookie 
>the same cookie value, and it is in fact somewhat common between normal and
>backup servers. See also the "cookie" keyword in backend section.
>  
> +error_limit 

"error-limit" please :-)

Well, overall this looks pretty good. I have two new questions.
You said we'd have default settings for the action and consecutive
errors so that we could enable a default behaviour by just setting
the "observe XXX" statement. I still have in mind my stupid choice
of the "stats" keyword where we have lots of default settings that
are all used once we enable any of them, making it difficult to
change them once set in a defaults instance. For instance, if you
put "stats uri /stats" in a defaults intance, you have stats enabled
everywhere.

I don't think the same issue could happen here because from my
understanding we still have to enable the behaviour by using
"observe XXX". But I'd like that you take some time to check for
a few corner cases we could fall into in case we've overseen
something. For instance, when supporting settings in defaults
instance, we don't want a simple default setting of the number
of consecutive errors to enable response checking for all servers
of all backends. We just want this to set the default value.

Maybe I'm not very clear, please tell me if so.

The second point is that I've got reports of people checking
very large numbers of servers (say 500) at a very high rate
(10-30 ms) because they prefer to fail a request quickly than
to send a reqeust to a server which might return bad contents.
(For this reason, I think your patch will interest them :-))
But those people are generally experiencing high CPU usages
because of the checks alone. I'm realizing that from the
start, I've never considered the checks to be on the critical
path. In such situations it can really be a critical path, and
we must keep in mind that we should keep processing low. I've
not seen anything expensive in your patch, but it's just a
reminder to help you make choices in the future if required.

Feel free to tell me when you'd like your patch to be merged.
In my opinion it's already in good shape.

Regards,
Willy




[PATCH] [RFC] Decrease server health based on http responses / events, version 2

2009-12-11 Thread Krzysztof Piotr Oledzki
Subject: [RFC] Decrease server health based on http responses / events, version 
2

This RFC quality patch implements decreasing server health based on
observing communication between HAProxy and servers.

Changes in this version:
 - documentation
 - close race between a started check and health analysis event
 - don't force fastinter if it is not set
 - better names for options
 - layer4 support

Possibility to set/change default options is going to be implemented in
a different patch.

diff --git a/doc/configuration.txt b/doc/configuration.txt
index ef768f8..bdb5c78 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -4642,6 +4642,13 @@ cookie 
   the same cookie value, and it is in fact somewhat common between normal and
   backup servers. See also the "cookie" keyword in backend section.
 
+error_limit 
+  If health observing is enabled, the "error_limit" parameter specifies the 
number
+  of consecutive errors that triggers event selected by the "on-error" option.
+  By default it is set to 10 consecutive errors.
+
+ See also the "check", "error_limit" and "on-error".
+
 fall 
   The "fall" parameter states that a server will be considered as dead after
consecutive unsuccessful health checks. This value defaults to 3 if
@@ -4707,9 +4714,31 @@ minconn 
   the ramp between both values when the backend has less than 
   concurrent connections. This makes it possible to limit the load on the
   server during normal loads, but push it further for important loads without
-  overloading the server during exceptionnal loads. See also the "maxconn"
+  overloading the server during exceptional loads. See also the "maxconn"
   and "maxqueue" parameters, as well as the "fullconn" backend keyword.
-  
+
+observe 
+  This option enables health adjusting based on observing communication with
+  the server. By default this functionality is disabled and enabling it also
+  requires to enable healt checks. There are two supported modes: "layer4" and
+  "layer7". In layer4 mode, only successful/unsuccessful tcp connections are
+  significant. In layer7, which is only allowed for http proxies, responses
+  received from server are verified, like valid/wrong http code, unparsable
+  headers, a timeout, etc.
+
+  See also the "check", "on-error" and "error_limit".
+
+on-error 
+  Select what should happen when enough consecutive errors are detected.
+  Currently, four modes are available:
+  - fastinter: force fastinter
+  - fail-check: simulate a failed check, also forces fastinter (default)
+  - sudden-death: simulate a pre-fatal failed health check, one more failed
+check will mark a server down, forces fastinter
+  - mark-down: mark the server immediately down and force fastinter
+
+  See also the "check", "observe" and "error_limit".
+
 port 
   Using the "port" parameter, it becomes possible to use a different port to
   send health-checks. On some servers, it may be desirable to dedicate a port
diff --git a/include/common/defaults.h b/include/common/defaults.h
index b0aee86..cada729 100644
--- a/include/common/defaults.h
+++ b/include/common/defaults.h
@@ -120,6 +120,9 @@
 #define DEF_CHECK_REQ   "OPTIONS / HTTP/1.0\r\n\r\n"
 #define DEF_SMTP_CHECK_REQ   "HELO localhost\r\n"
 
+#define DEF_HANA_ONERR HANA_ONERR_FAILCHK
+#define DEF_HANA_ERRLIMIT  10
+
 // X-Forwarded-For header default
 #define DEF_XFORWARDFOR_HDR"X-Forwarded-For"
 
diff --git a/include/proto/checks.h b/include/proto/checks.h
index bd70164..e3fe7ac 100644
--- a/include/proto/checks.h
+++ b/include/proto/checks.h
@@ -29,6 +29,7 @@ const char *get_check_status_description(short check_status);
 const char *get_check_status_info(short check_status);
 struct task *process_chk(struct task *t);
 int start_checks();
+void health_adjust(struct server *s, short status);
 
 #endif /* _PROTO_CHECKS_H */
 
diff --git a/include/types/checks.h b/include/types/checks.h
index 1b04608..ac784dd 100644
--- a/include/types/checks.h
+++ b/include/types/checks.h
@@ -18,6 +18,9 @@ enum {
 
/* Below we have finished checks */
HCHK_STATUS_CHECKED,/* DUMMY STATUS */
+
+   HCHK_STATUS_HANA,   /* Healt analyze detected enough 
consecutive errors */
+
HCHK_STATUS_SOCKERR,/* Socket error */
 
HCHK_STATUS_L4OK,   /* L4 check passed, for example tcp 
connect */
@@ -41,8 +44,51 @@ enum {
HCHK_STATUS_SIZE
 };
 
+
+/* healt status for response tracking */
+enum {
+   HANA_STATUS_UNKNOWN = 0,
+
+   HANA_STATUS_L4_OK,  /* L4 successful connection */
+   HANA_STATUS_L4_ERR, /* L4 unsuccessful connection */
+
+   HANA_STATUS_HTTP_OK,/* Correct http response */
+   HANA_STATUS_HTTP_STS,   /* Wrong http response, for example 
HTTP 5xx */
+   HANA_STATUS_HTTP_HDRRSP,/* Invalid http response (headers) */
+   HANA_STATUS_HTTP_RSP,   /* Invalid http response */
+
+   HANA_

Re: [PATCH] [RFC] Decrease server health based on http responses / events

2009-12-10 Thread Willy Tarreau
On Thu, Dec 10, 2009 at 07:17:30PM +0100, Krzysztof Ol?dzki wrote:
(...)
> >"events" by itself might not be the proper name then. For instance, a
> >timeout is precisely a lack of event.
> 
> It's a matter of definition, like if 0 is or is not a natural number. 
> For me timeout *is* one of events, but no problem - we can choose a name 
> that is intuitive for everyone.
> 
> >Maybe simply "errors" ? The
> >other ones are not errors, just plain valid status codes after all.
> 
> OK, but we are not only going to observing errors, but also succesfull 
> connections to clear error counter. So maybe simple "layer4" (for tcp) 
> and "layer7" (for http) might be OK?

If you see various uses of these levels, then yes, I agree.

> >yes, precisely. Another advantage would be that we could also
> >allow the statement on regular backend config (even defaults)
> >when it's supposed to be the same for all servers. It would
> >then be handled just like the "source" keyword : per-server,
> >then per-backend.
> 
> Right. A backend/default value is important. However, I think we:
>  - must use the same syntax for both backend's and server's configuration,
>  - don't need to require to always specify all parameters
>  - should keep current type of names: "type value" (not "to", "by", 
> "after".
> 
> So, I'll be more happy with "on-error fail-check" than with "by 
> fail-check", etc.

OK, that's fine by me. I trust you to find an intuitive language :-)

Best regards,
Willy




Re: [PATCH] [RFC] Decrease server health based on http responses / events

2009-12-10 Thread Krzysztof Olędzki

On 2009-12-10 08:39, Willy Tarreau wrote:

Hi Krzysztof,

Hi Willy,


(...)
What are those "events" supposed to check for ? I've not found them 
anywhere else.
Ineed. It is for pure TCP, where we can only track timeouts, resets, 
etc. However, I'm looking for a good place where to attach those checks, 
and for a better name - maybe "l3events"?


"events" by itself might not be the proper name then. For instance, a
timeout is precisely a lack of event.


It's a matter of definition, like if 0 is or is not a natural number. 
For me timeout *is* one of events, but no problem - we can choose a name 
that is intuitive for everyone.



Maybe simply "errors" ? The
other ones are not errors, just plain valid status codes after all.


OK, but we are not only going to observing errors, but also succesfull 
connections to clear error counter. So maybe simple "layer4" (for tcp) 
and "layer7" (for http) might be OK?



I mean, maybe we could have sort of an error-react prefix
with its few parameters afterwards. Maybe something in that spirit :

  error-react to  by  after 

It's just an idea, not necessarily something to follow.

Something like "error-react to http-response by failcheck after 10"?


yes, precisely. Another advantage would be that we could also
allow the statement on regular backend config (even defaults)
when it's supposed to be the same for all servers. It would
then be handled just like the "source" keyword : per-server,
then per-backend.


Right. A backend/default value is important. However, I think we:
 - must use the same syntax for both backend's and server's configuration,
 - don't need to require to always specify all parameters
 - should keep current type of names: "type value" (not "to", "by", 
"after".


So, I'll be more happy with "on-error fail-check" than with "by 
fail-check", etc.




Does this mean that we're now forced to at least switch to fast
inter in case of error or can we still use the current behaviour ?

Yes, by simply not enabling the functionality. By default it is not enabled.


OK that's fine. You know how I'm attached to keep backwards
compatibility :-)


Sure thing. ;)


Best regards,

Krzysztof Olędzki



Re: [PATCH] [RFC] Decrease server health based on http responses / events

2009-12-09 Thread Willy Tarreau
Hi Krzysztof,

On Wed, Dec 09, 2009 at 01:23:57PM +0100, Krzysztof Ol?dzki wrote:
> There are four modes:
> 
>  - fastinter: force fastinter
> 
>  - failchk: simlate a failed check -> force fastinter

 OK so that one should decrease the health each time we get a
 series of errors, that's it ?
> 
>  - suddth (sudden death): simulate a pre-fatal failed health check, one
> more failed check will marke a server down
> 
>  - markdwn: mark a server down, immediately

OK, I think that covers a wide range of usage patterns.

(...)
> >What are those "events" supposed to check for ? I've not found them 
> >anywhere else.
> 
> Ineed. It is for pure TCP, where we can only track timeouts, resets, 
> etc. However, I'm looking for a good place where to attach those checks, 
> and for a better name - maybe "l3events"?

"events" by itself might not be the proper name then. For instance, a
timeout is precisely a lack of event. Maybe simply "errors" ? The
other ones are not errors, just plain valid status codes after all.

(...)
> >I'm just wondering about those options. Are we supposed to use only some 
> >of them
> >without the other ones ?
> 
> Yes, because each option has its default value, so both:
>  observe http-response onerror failcheck errors-limit 10
> and
>  observe http-response
> are identical - after 10 consecutive errors haproxy simulates a filed check.

OK.

> >I mean, maybe we could have sort of an error-react prefix
> >with its few parameters afterwards. Maybe something in that spirit :
> >
> >   error-react to  by  after 
> > 
> >It's just an idea, not necessarily something to follow.
> 
> Something like "error-react to http-response by failcheck after 10"?

yes, precisely. Another advantage would be that we could also
allow the statement on regular backend config (even defaults)
when it's supposed to be the same for all servers. It would
then be handled just like the "source" keyword : per-server,
then per-backend.

(...)
> >Are these three isolated changes on purpose, are they a mistake, or are 
> >they a
> >fix for something we want to backport or at least merge separately ? I'm 
> >asking
> >because at first glance it seems imbalanced with other changes. There is a
> >fourth one further about check_duration.
> 
> These three isolated changes are on purpose, bacuase now we can call 
> server_status_printf() when we simulate a failed halth check. Before 
> that we used to first start a check and set s->check_start and s->result 
> and then we called server_status_printf(), but it is no longer true.

OK.

(...)
> >Does this mean that we're now forced to at least switch to fast
> >inter in case of error or can we still use the current behaviour ?
> 
> Yes, by simply not enabling the functionality. By default it is not enabled.

OK that's fine. You know how I'm attached to keep backwards
compatibility :-)

Regards,
Willy




Re: [PATCH] [RFC] Decrease server health based on http responses / events

2009-12-09 Thread Krzysztof Olędzki

On 2009-12-08 22:37, Willy Tarreau wrote:

Hi Krzysztof,

Hi Willy,


it was fortunate that you reminded me about this mail because
I had lost it in the middle of a few others.


No problem. Like I told you in the private mail - it is doubtful I would 
have been able to get to it earlier, anyway.



On Sun, Oct 25, 2009 at 01:35:35AM +0200, Krzysztof Piotr Oledzki wrote:

Subject: [RFC] Decrease server health based on http responses / events

This RFC quality patch implements decreasing server health based on
observing communication between HAProxy and servers.

I have had a working patch for this for a long time, however I needed to
rewrite nearly everything to remove hardcoded values, add more modes and
to port it into 1.4. So after the rework there is nearly nothing left from
the old code. :| In the current status the code is expected to work but it
definitely needs more testing.


OK.


BTW: I'm not very happy with names of both functions and parameters,
If you have a better idea please don't hesitate to propose it. ;)


well, first, s/halth/health/g :-)


Eh. ;) I feel ashamed.


TODO: documentation, comments, pure tcp support.


Indeed, comments at least on the enums and struct members would make
the review *much* easier. Even small abbreviated ones :-/


I assumed that everithing should be clear but it seems it is true only 
for the author of this code. :| Sorry for that.



diff --git a/include/common/defaults.h b/include/common/defaults.h
index b0aee86..ae2f65c 100644
--- a/include/common/defaults.h
+++ b/include/common/defaults.h
@@ -120,6 +120,9 @@
 #define DEF_CHECK_REQ   "OPTIONS / HTTP/1.0\r\n\r\n"
 #define DEF_SMTP_CHECK_REQ   "HELO localhost\r\n"
 
+#define DEF_HANA_ONERR	HANA_ONERR_FAILCHK

+#define DEF_CELIMIT10
+
 // X-Forwarded-For header default
 #define DEF_XFORWARDFOR_HDR"X-Forwarded-For"
 
diff --git a/include/proto/checks.h b/include/proto/checks.h

index bd70164..2d16976 100644
--- a/include/proto/checks.h
+++ b/include/proto/checks.h
@@ -29,6 +29,7 @@ const char *get_check_status_description(short check_status);
 const char *get_check_status_info(short check_status);
 struct task *process_chk(struct task *t);
 int start_checks();
+void halth_analyze(struct server *s, short status);
 
 #endif /* _PROTO_CHECKS_H */
 
diff --git a/include/types/checks.h b/include/types/checks.h

index 1b04608..3690aa5 100644
--- a/include/types/checks.h
+++ b/include/types/checks.h
@@ -18,6 +18,9 @@ enum {
 
 	/* Below we have finished checks */

HCHK_STATUS_CHECKED,/* DUMMY STATUS */
+
+   HCHK_STATUS_HANA,   /* Detected enough consecutive errors */
+
HCHK_STATUS_SOCKERR,/* Socket error */
 
 	HCHK_STATUS_L4OK,		/* L4 check passed, for example tcp connect */

@@ -41,6 +44,39 @@ enum {
HCHK_STATUS_SIZE
 };
 
+enum {

+   HANA_UNKNOWN= 0,
+
+   HANA_TCP_OK,
+
+   HANA_HTTP_OK,
+   HANA_HTTP_STS,
+   HANA_HTTP_HDRRSP,
+   HANA_HTTP_RSP,
+
+   HANA_READ_ERROR,
+   HANA_READ_TIMEOUT,
+   HANA_BROKEN_PIPE,
+
+   HANA_SIZE
+};
+
+enum {
+   HANA_ONERR_UNKNOWN  = 0,
+
+   HANA_ONERR_FASTINTER,
+   HANA_ONERR_FAILCHK,
+   HANA_ONERR_SUDDTH,
+   HANA_ONERR_MARKDWN,


After reading the code, I'm not sure I got the difference between
the last two modes (sudden death and mark down).


There are four modes:

 - fastinter: force fastinter

 - failchk: simlate a failed check -> force fastinter

 - suddth (sudden death): simulate a pre-fatal failed health check, one
more failed check will marke a server down

 - markdwn: mark a server down, immediately


+};
+
+enum {
+   HANA_OBS_NONE   = 0,
+
+   HANA_OBS_EVENTS,
+   HANA_OBS_HTTP_RSPS,
+};
+
 struct check_status {
short result;   /* one of SRV_CHK_* */
char *info; /* human readable short info */
diff --git a/include/types/server.h b/include/types/server.h
index b3fe83d..b163190 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -115,7 +115,10 @@ struct server {
struct sockaddr_in check_addr;  /* the address to check, if different 
from  */
short check_port;   /* the port to use for the 
health checks */
int health; /* 0->rise-1 = bad; 
rise->rise+fall-1 = good */
+   int consecutive_errors; /* */
int rise, fall; /* time in iterations */
+   int consecutive_errors_limit;   /* */
+   short observe, onerror; /* */
int inter, fastinter, downinter;/* checks: time in milliseconds 
*/
int slowstart;  /* slowstart time in seconds 
(ms in the conf) */
int result; /* health-check result : 
SRV_CHK_* */
@@ -137,7 +140,7 @@ struct server {
unsigned down_time; /* total time the 

Re: [PATCH] [RFC] Decrease server health based on http responses / events

2009-12-08 Thread Willy Tarreau
Hi Krzysztof,

it was fortunate that you reminded me about this mail because
I had lost it in the middle of a few others.

On Sun, Oct 25, 2009 at 01:35:35AM +0200, Krzysztof Piotr Oledzki wrote:
> Subject: [RFC] Decrease server health based on http responses / events
> 
> This RFC quality patch implements decreasing server health based on
> observing communication between HAProxy and servers.
> 
> I have had a working patch for this for a long time, however I needed to
> rewrite nearly everything to remove hardcoded values, add more modes and
> to port it into 1.4. So after the rework there is nearly nothing left from
> the old code. :| In the current status the code is expected to work but it
> definitely needs more testing.

OK.

> BTW: I'm not very happy with names of both functions and parameters,
> If you have a better idea please don't hesitate to propose it. ;)

well, first, s/halth/health/g :-)

> TODO: documentation, comments, pure tcp support.

Indeed, comments at least on the enums and struct members would make
the review *much* easier. Even small abbreviated ones :-/

> diff --git a/include/common/defaults.h b/include/common/defaults.h
> index b0aee86..ae2f65c 100644
> --- a/include/common/defaults.h
> +++ b/include/common/defaults.h
> @@ -120,6 +120,9 @@
>  #define DEF_CHECK_REQ   "OPTIONS / HTTP/1.0\r\n\r\n"
>  #define DEF_SMTP_CHECK_REQ   "HELO localhost\r\n"
>  
> +#define DEF_HANA_ONERR   HANA_ONERR_FAILCHK
> +#define DEF_CELIMIT  10
> +
>  // X-Forwarded-For header default
>  #define DEF_XFORWARDFOR_HDR  "X-Forwarded-For"
>  
> diff --git a/include/proto/checks.h b/include/proto/checks.h
> index bd70164..2d16976 100644
> --- a/include/proto/checks.h
> +++ b/include/proto/checks.h
> @@ -29,6 +29,7 @@ const char *get_check_status_description(short 
> check_status);
>  const char *get_check_status_info(short check_status);
>  struct task *process_chk(struct task *t);
>  int start_checks();
> +void halth_analyze(struct server *s, short status);
>  
>  #endif /* _PROTO_CHECKS_H */
>  
> diff --git a/include/types/checks.h b/include/types/checks.h
> index 1b04608..3690aa5 100644
> --- a/include/types/checks.h
> +++ b/include/types/checks.h
> @@ -18,6 +18,9 @@ enum {
>  
>   /* Below we have finished checks */
>   HCHK_STATUS_CHECKED,/* DUMMY STATUS */
> +
> + HCHK_STATUS_HANA,   /* Detected enough consecutive errors */
> +
>   HCHK_STATUS_SOCKERR,/* Socket error */
>  
>   HCHK_STATUS_L4OK,   /* L4 check passed, for example tcp 
> connect */
> @@ -41,6 +44,39 @@ enum {
>   HCHK_STATUS_SIZE
>  };
>  
> +enum {
> + HANA_UNKNOWN= 0,
> +
> + HANA_TCP_OK,
> +
> + HANA_HTTP_OK,
> + HANA_HTTP_STS,
> + HANA_HTTP_HDRRSP,
> + HANA_HTTP_RSP,
> +
> + HANA_READ_ERROR,
> + HANA_READ_TIMEOUT,
> + HANA_BROKEN_PIPE,
> +
> + HANA_SIZE
> +};
> +
> +enum {
> + HANA_ONERR_UNKNOWN  = 0,
> +
> + HANA_ONERR_FASTINTER,
> + HANA_ONERR_FAILCHK,
> + HANA_ONERR_SUDDTH,
> + HANA_ONERR_MARKDWN,

After reading the code, I'm not sure I got the difference between
the last two modes (sudden death and mark down).

> +};
> +
> +enum {
> + HANA_OBS_NONE   = 0,
> +
> + HANA_OBS_EVENTS,
> + HANA_OBS_HTTP_RSPS,
> +};
> +
>  struct check_status {
>   short result;   /* one of SRV_CHK_* */
>   char *info; /* human readable short info */
> diff --git a/include/types/server.h b/include/types/server.h
> index b3fe83d..b163190 100644
> --- a/include/types/server.h
> +++ b/include/types/server.h
> @@ -115,7 +115,10 @@ struct server {
>   struct sockaddr_in check_addr;  /* the address to check, if 
> different from  */
>   short check_port;   /* the port to use for the 
> health checks */
>   int health; /* 0->rise-1 = bad; 
> rise->rise+fall-1 = good */
> + int consecutive_errors; /* */
>   int rise, fall; /* time in iterations */
> + int consecutive_errors_limit;   /* */
> + short observe, onerror; /* */
>   int inter, fastinter, downinter;/* checks: time in milliseconds 
> */
>   int slowstart;  /* slowstart time in seconds 
> (ms in the conf) */
>   int result; /* health-check result : 
> SRV_CHK_* */
> @@ -137,7 +140,7 @@ struct server {
>   unsigned down_time; /* total time the server was 
> down */
>   time_t last_change; /* last time, when the state 
> was changed */
>   struct timeval check_start; /* last health check start time 
> */
> - unsigned long check_duration;   /* time in ms took to finish 
> last health check */
> + long check_duration;/* time in ms took to finish 
> last health check */
>   short che

[PATCH] [RFC] Decrease server health based on http responses / events

2009-10-24 Thread Krzysztof Piotr Oledzki
Subject: [RFC] Decrease server health based on http responses / events

This RFC quality patch implements decreasing server health based on
observing communication between HAProxy and servers.

I have had a working patch for this for a long time, however I needed to
rewrite nearly everything to remove hardcoded values, add more modes and
to port it into 1.4. So after the rework there is nearly nothing left from
the old code. :| In the current status the code is expected to work but it
definitely needs more testing.

BTW: I'm not very happy with names of both functions and parameters,
If you have a better idea please don't hesitate to propose it. ;)

TODO: documentation, comments, pure tcp support.

diff --git a/include/common/defaults.h b/include/common/defaults.h
index b0aee86..ae2f65c 100644
--- a/include/common/defaults.h
+++ b/include/common/defaults.h
@@ -120,6 +120,9 @@
 #define DEF_CHECK_REQ   "OPTIONS / HTTP/1.0\r\n\r\n"
 #define DEF_SMTP_CHECK_REQ   "HELO localhost\r\n"
 
+#define DEF_HANA_ONERR HANA_ONERR_FAILCHK
+#define DEF_CELIMIT10
+
 // X-Forwarded-For header default
 #define DEF_XFORWARDFOR_HDR"X-Forwarded-For"
 
diff --git a/include/proto/checks.h b/include/proto/checks.h
index bd70164..2d16976 100644
--- a/include/proto/checks.h
+++ b/include/proto/checks.h
@@ -29,6 +29,7 @@ const char *get_check_status_description(short check_status);
 const char *get_check_status_info(short check_status);
 struct task *process_chk(struct task *t);
 int start_checks();
+void halth_analyze(struct server *s, short status);
 
 #endif /* _PROTO_CHECKS_H */
 
diff --git a/include/types/checks.h b/include/types/checks.h
index 1b04608..3690aa5 100644
--- a/include/types/checks.h
+++ b/include/types/checks.h
@@ -18,6 +18,9 @@ enum {
 
/* Below we have finished checks */
HCHK_STATUS_CHECKED,/* DUMMY STATUS */
+
+   HCHK_STATUS_HANA,   /* Detected enough consecutive errors */
+
HCHK_STATUS_SOCKERR,/* Socket error */
 
HCHK_STATUS_L4OK,   /* L4 check passed, for example tcp 
connect */
@@ -41,6 +44,39 @@ enum {
HCHK_STATUS_SIZE
 };
 
+enum {
+   HANA_UNKNOWN= 0,
+
+   HANA_TCP_OK,
+
+   HANA_HTTP_OK,
+   HANA_HTTP_STS,
+   HANA_HTTP_HDRRSP,
+   HANA_HTTP_RSP,
+
+   HANA_READ_ERROR,
+   HANA_READ_TIMEOUT,
+   HANA_BROKEN_PIPE,
+
+   HANA_SIZE
+};
+
+enum {
+   HANA_ONERR_UNKNOWN  = 0,
+
+   HANA_ONERR_FASTINTER,
+   HANA_ONERR_FAILCHK,
+   HANA_ONERR_SUDDTH,
+   HANA_ONERR_MARKDWN,
+};
+
+enum {
+   HANA_OBS_NONE   = 0,
+
+   HANA_OBS_EVENTS,
+   HANA_OBS_HTTP_RSPS,
+};
+
 struct check_status {
short result;   /* one of SRV_CHK_* */
char *info; /* human readable short info */
diff --git a/include/types/server.h b/include/types/server.h
index b3fe83d..b163190 100644
--- a/include/types/server.h
+++ b/include/types/server.h
@@ -115,7 +115,10 @@ struct server {
struct sockaddr_in check_addr;  /* the address to check, if 
different from  */
short check_port;   /* the port to use for the 
health checks */
int health; /* 0->rise-1 = bad; 
rise->rise+fall-1 = good */
+   int consecutive_errors; /* */
int rise, fall; /* time in iterations */
+   int consecutive_errors_limit;   /* */
+   short observe, onerror; /* */
int inter, fastinter, downinter;/* checks: time in milliseconds 
*/
int slowstart;  /* slowstart time in seconds 
(ms in the conf) */
int result; /* health-check result : 
SRV_CHK_* */
@@ -137,7 +140,7 @@ struct server {
unsigned down_time; /* total time the server was 
down */
time_t last_change; /* last time, when the state 
was changed */
struct timeval check_start; /* last health check start time 
*/
-   unsigned long check_duration;   /* time in ms took to finish 
last health check */
+   long check_duration;/* time in ms took to finish 
last health check */
short check_status, check_code; /* check result, check code */
char check_desc[HCHK_DESC_LEN]; /* healt check descritpion */
 
diff --git a/src/cfgparse.c b/src/cfgparse.c
index 428d7b9..889fc74 100644
--- a/src/cfgparse.c
+++ b/src/cfgparse.c
@@ -2541,6 +2541,8 @@ int cfg_parse_listen(const char *file, int linenum, char 
**args, int kwm)
newsrv->uweight = newsrv->iweight = 1;
newsrv->maxqueue = 0;
newsrv->slowstart = 0;
+   newsrv->onerror = DEF_HANA_ONERR;
+   newsrv->consecutive_errors_limit = DEF_CELIMIT;
 
cur_arg = 3;
while (