On Tue, Oct 07, 2014 at 03:18:28PM +0200, Cornelius Riemenschneider wrote:
> Hello,
> 
> I investigated the issue a bit further.
> 
> We both use health checks and agent-checks, the health check reports the
> usual up/down/connection failed, and the agent-check provides us with a
> dynamic weight.

OK.

> The problem when we enter the DOWN (agent)-state occurs when our server
> (java-based) enters a long garbage collection, which causes the server to
> stop from anything from 40sec to 5min. This is a bug itself, but has been
> dealt with.

OK.

> This causes the JVM to not answer to connection requests (how exactly, I do
> not know), but during that time, both the health and the agent check fail,
> because they cannot connect to the server (which is expected).

That can happen when the SYN queue of the socket fills up.

> The issue now seems to be that somehow, the down-state is not reset after the
> health check comes up - the LastChk column says "Checked", and by manual
> verification, the health check is back to 200, but I think haproxy might be
> stuck in the agent-down state and expect a "up" from the agent - which will
> never come, because the agent did not cause the down state initially.
> 
> Could that be a possibility?

Sure it's a possibility. Both check methods are stepping on each other's
toes, so even after careful checks, such bugs are very possible. If you
find a reliable way to reproduce it, that could significantly help! Another
option is to look through the code and check the conditions to get up and
down, but that could be more complicated since there are still a number...

> We also do not see the issue with 1.5-dev22, which has been stable for us for
> some months.

We had a lot of exceptions everywhere in the code to deal with the interaction
between agent and checks, and some were later found to still be missing. The
changes in dev25 (or so) were made precisely to address these difficulties.

> > Do you think it would be enough if we add in the doc that the stats page 
> > also reports weight 0 as "DRAIN" ?
> 
> Yep, that sounds good :)

Great, will do so.

Thanks,
Willy


Reply via email to