Hi Andrew,

On Wed, Sep 14, 2016 at 02:44:26PM -0400, Andrew Rodland wrote:
> On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> > > > Also I've been thinking about this issue of the infinite loop that you
> > > > solved already. As long as c > 1 I don't think it can happen at all,
> > > > because for any server having a load strictly greater than the average
> > > > load, it means there exists at least one server with a load smaller than
> > > > or equal to the average. Otherwise it means there's no more server in
> > > > the ring because all servers are down, and then the initial lookup will
> > > > simply return NULL. Maybe there's an issue with the current lookup
> > > > method, we'll have to study this.
> > > 
> > > Agreed again, it should be impossible as long as c > 1, but I ran into it.
> > > I assumed it was some problem or misunderstanding in my code.
> > 
> > Don't worry I trust you, I was trying to figure what exact case could
> > cause this and couldn't find a single possible case :-/
> 
> I've encountered this again in my re-written branch. I think it has to do 
> with 
> the case where all servers are draining for shutdown. What I see is that 
> whenever I do a restart (haproxy -sf oldpid) under load, the new process 
> starts up, but the old process never exits, and perf shows it using 100% CPU 
> in chash_server_is_eligible, so it's got to be looping and deciding nothing 
> is 
> eligible. Can you think of anything special that needs to be done to handle 
> graceful shutdown?

No, that's very strange. We may have a bug somewhere else which never
stroke till now. When you talk about a shutdown, you in fact mean the
shutdown of the haproxy process being replaced by another one, that's
right ? If so, health checks are disabled during that period so servers
should not be added to nor removed from the ring.

However if for any reason there's a graceful shutdown on the servers,
their weight can be set to zero while they're still active. In this
case they don't appear in the tree and that may be where the issue
starts. It would be nice to get a 100% reproducible case to try to
debug it and dump all weights and capacities, I think it would help.

Willy

Reply via email to