Re: Higher 'peer' traffic with 1.5.12

Willy Tarreau Tue, 19 May 2015 15:28:51 -0700

Hi Marco,

On Tue, May 19, 2015 at 08:20:05AM +0200, Willy Tarreau wrote:
> Hi Marco,
> 
> On Tue, May 19, 2015 at 08:13:28AM +0200, Marco Corte wrote:
> > Il 19/05/2015 05:21, Willy Tarreau ha scritto:
> > >Hi Marco,
> > >
> > >I think the easiest thing to start with is to run "netstat -atn" on the
> > >backup node to verify if the peers connection is always between the same
> > >two ports or if it changes, indicating a reconnection.
> > >
> > 
> > Hi, Willy
> > 
> > I did not find yet the 1.5.11 package to test the older version, but I 
> > hope that the 'netstat' output is interesting.
> > 
> > All IP are owned by 10.64.38.1 that manages all traffic while haproxy on 
> > 10.64.38.2 does not receive any connection.
> > 
> > I rebooted 10.64.38.2, then I launched two 'netstat -atn', one 5s after 
> > the other. The output is filtered and sorted.
> 
> Wow, so there are multiple attempts to synchronize in parallel, resulting
> in connections being killed and restarted! I have no idea what can cause
> this, normally this used to happen only when one node was started with
> nbproc > 1. I would not have expected to see something like this, and I
> don't understand what can cause it. I'll have to try to reproduce. It's
> possible that I'll contact you to get more information.


Just a quick update to tell you that I could reproduce and explain the
issue. That's a bug that we've always had in the peers and that was made
much easier to trigger when fixing the missing timeout bug. The principle
is that now if there's no activity on the peers session, the timeout
strikes after 5s. It can be noticed on machines with low traffic but on
a regular prod it will not be noticed. But the fact that it can happen
once in a while triggers an issue with the reconnection. Both peers
attempt to connect to the other one, both receive a connection from
the other one, both close and both start again. There's no random delay
to prevent this, so this situation can last several seconds and cause
a lot of traffic. The farthest apart the peers are, the most likely it
is to happen. I noticed that by simply starting strace on one of the
processes, I imbalanced their latencies and immediately put an end to
the loop.

We've checked with Emeric and we now see how to move the delay into
the existing code (we don't want to introduce regressions there as you
can expect, so we're very prudent). I hope to be able to issue the fix
tomorrow.

In the end, it was a bug hidden by another one, something which happens
once in a while.

Thanks for being patient,
Willy

Re: Higher 'peer' traffic with 1.5.12

Reply via email to