Hi Marco, On Tue, May 19, 2015 at 08:20:05AM +0200, Willy Tarreau wrote: > Hi Marco, > > On Tue, May 19, 2015 at 08:13:28AM +0200, Marco Corte wrote: > > Il 19/05/2015 05:21, Willy Tarreau ha scritto: > > >Hi Marco, > > > > > >I think the easiest thing to start with is to run "netstat -atn" on the > > >backup node to verify if the peers connection is always between the same > > >two ports or if it changes, indicating a reconnection. > > > > > > > Hi, Willy > > > > I did not find yet the 1.5.11 package to test the older version, but I > > hope that the 'netstat' output is interesting. > > > > All IP are owned by 10.64.38.1 that manages all traffic while haproxy on > > 10.64.38.2 does not receive any connection. > > > > I rebooted 10.64.38.2, then I launched two 'netstat -atn', one 5s after > > the other. The output is filtered and sorted. > > Wow, so there are multiple attempts to synchronize in parallel, resulting > in connections being killed and restarted! I have no idea what can cause > this, normally this used to happen only when one node was started with > nbproc > 1. I would not have expected to see something like this, and I > don't understand what can cause it. I'll have to try to reproduce. It's > possible that I'll contact you to get more information.
Just a quick update to tell you that I could reproduce and explain the issue. That's a bug that we've always had in the peers and that was made much easier to trigger when fixing the missing timeout bug. The principle is that now if there's no activity on the peers session, the timeout strikes after 5s. It can be noticed on machines with low traffic but on a regular prod it will not be noticed. But the fact that it can happen once in a while triggers an issue with the reconnection. Both peers attempt to connect to the other one, both receive a connection from the other one, both close and both start again. There's no random delay to prevent this, so this situation can last several seconds and cause a lot of traffic. The farthest apart the peers are, the most likely it is to happen. I noticed that by simply starting strace on one of the processes, I imbalanced their latencies and immediately put an end to the loop. We've checked with Emeric and we now see how to move the delay into the existing code (we don't want to introduce regressions there as you can expect, so we're very prudent). I hope to be able to issue the fix tomorrow. In the end, it was a bug hidden by another one, something which happens once in a while. Thanks for being patient, Willy

