Hi Henry,

I found the bug. It happens when hashing a parameter which is not found
(typically an absent URL parameter) and then the connection to the selected
server experiences connection retries, and the server goes down and up
before the redispatching, and is the last server in the farm when the
redispatch happens, then we can loop forever in chash_get_next_server().

I'm surprized nobody else encountered it since the time it has existed.
Most likely people using consistent hashing mostly apply it to URL or
source address, both of which are always present.

I have fixed it with the attached patch which can be applied as-is to
both 1.4 and 1.5.

Thanks a lot for all the elements you provided, they were extremely useful,
I'm deleting them now.

Best regards,
Willy

>From d16a1b2a818359e8c3ade85f789e66ed7ca9488c Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Fri, 12 Apr 2013 14:46:51 +0200
Subject: BUG/MAJOR: backend: consistent hash can loop forever in certain
 circumstances

When the parameter passed to a consistent hash is not found, we fall back to
round-robin using chash_get_next_server(). This one stores the last visited
server in lbprm.chash.last, which can be NULL upon the first invocation or if
the only server was recently brought up.

The loop used to scan for a server is able to skip the previously attempted
server in case of a redispatch, by passing this previous server in srvtoavoid.
For this reason, the loop stops when the currently considered server is
different from srvtoavoid and different from the original chash.last.

A problem happens in a special sequence : if a connection to a server fails,
then all servers are removed from the farm, then the original server is added
again before the redispatch happens, we have chash.last = NULL and srvtoavoid
set to the only server in the farm. Then this server is always equal to
srvtoavoid and never to NULL, and the loop never stops.

The fix consists in assigning the stop point to the first encountered node if
it was not yet set.

This issue cannot happen with the map-based algorithm since it's based on an
index and not a stop point.

This issue was reported by Henry Qian who kindly provided lots of critically
useful information to figure out the conditions to reproduce the issue.

The fix needs to be backported to 1.4 which is also affected.
---
 src/lb_chash.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/src/lb_chash.c b/src/lb_chash.c
index 58f1c9e..d65de74 100644
--- a/src/lb_chash.c
+++ b/src/lb_chash.c
@@ -332,6 +332,13 @@ struct server *chash_get_next_server(struct proxy *p, 
struct server *srvtoavoid)
                        /* no node is available */
                        return NULL;
 
+               /* Note: if we came here after a down/up cycle with no last
+                * pointer, and after a redispatch (srvtoavoid is set), we
+                * must set stop to non-null otherwise we can loop forever.
+                */
+               if (!stop)
+                       stop = node;
+
                /* OK, we have a server. However, it may be saturated, in which
                 * case we don't want to reconsider it for now, so we'll simply
                 * skip it. Same if it's the server we try to avoid, in which
-- 
1.7.12.4.dirty

Reply via email to