On Thursday, September 15, 2016 4:06:15 AM EDT Willy Tarreau wrote:
> Hi Andrew,
> On Wed, Sep 14, 2016 at 02:44:26PM -0400, Andrew Rodland wrote:
> > On Sunday, September 11, 2016 7:57:41 PM EDT Willy Tarreau wrote:
> > > > > Also I've been thinking about this issue of the infinite loop that
> > > > > you
> > > > > solved already. As long as c > 1 I don't think it can happen at all,
> > > > > because for any server having a load strictly greater than the
> > > > > average
> > > > > load, it means there exists at least one server with a load smaller
> > > > > than
> > > > > or equal to the average. Otherwise it means there's no more server
> > > > > in
> > > > > the ring because all servers are down, and then the initial lookup
> > > > > will
> > > > > simply return NULL. Maybe there's an issue with the current lookup
> > > > > method, we'll have to study this.
> > > >
> > > > Agreed again, it should be impossible as long as c > 1, but I ran into
> > > > it.
> > > > I assumed it was some problem or misunderstanding in my code.
> > >
> > > Don't worry I trust you, I was trying to figure what exact case could
> > > cause this and couldn't find a single possible case :-/
> > I've encountered this again in my re-written branch. I think it has to do
> > with the case where all servers are draining for shutdown. What I see is
> > that whenever I do a restart (haproxy -sf oldpid) under load, the new
> > process starts up, but the old process never exits, and perf shows it
> > using 100% CPU in chash_server_is_eligible, so it's got to be looping and
> > deciding nothing is eligible. Can you think of anything special that
> > needs to be done to handle graceful shutdown?
> No, that's very strange. We may have a bug somewhere else which never
> stroke till now. When you talk about a shutdown, you in fact mean the
> shutdown of the haproxy process being replaced by another one, that's
> right ? If so, health checks are disabled during that period so servers
> should not be added to nor removed from the ring.
> However if for any reason there's a graceful shutdown on the servers,
> their weight can be set to zero while they're still active. In this
> case they don't appear in the tree and that may be where the issue
> starts. It would be nice to get a 100% reproducible case to try to
> debug it and dump all weights and capacities, I think it would help.
I haven't found the cause of this, or been able to pin it down much further
than that it happens fairly reliably when doing a "haproxy -sf" restart under
load. Other than that, I think I have things working properly and would
appreciate a bit of review. My changes are on the "bounded-chash" branch of
github.com/arodland/haproxy — or would you prefer a patch series sent to the