Hi Emerson,

On Tue, Jan 15, 2019 at 12:21:07PM +0100, Emerson Gomes wrote:
> Hi Willy, Tim,
> 
> I am providing some more details about my setup if you wish to try to
> reproduce the issue.
> As I mentioned before, I have 5 HAProxy nodes, all of them listening to
> public IPs.
> My DNS is setup with round-robin mode on AWS R53, resolving to one of the
> HAProxy nodes individual IPs for each request.
> It means that very commonly one client will have multiple connections with
> many (or even) all nodes in the cluster - Also they do tend to
> connect/disconnect fast (little keep-alive usage), making this racing
> condition quite likely to happen.
> 
> I suppose the scenario Tim described earlier is accurate:
> 
> - Connect to peer A     (A=1, B=0)
> - Peer A sends 1 to B   (A=1, B=1)
> - Kill connection to A  (A=0, B=1)
> - Connect to peer B     (A=0, B=2)
> - Peer A sends 0 to B   (A=0, B=0)
> - Peer B sends 0/2 to A (A=?, B=0)
> - Kill connection to B  (A=?, B=-1)
> - Peer B sends -1 to A  (A=-1, B=-1)

Got it! I thought the problem was local to a process and that we
replicated bad data, but in fact not, it's a distributed race. In
this case there is no other short-term solution, and the drift has
no reason to significantly accumulate over time. The only long-term
solution I'd be seeing to work around this specific pattern would be
to keep such values as differential pairs :
  - count and synchronize the number of ++
  - count and synchronize the number of --
In this case the real value is the difference between the two. But
it's a bit overkill and is still prone to other races when connections
appear in parallel on the two peers. Then at this point better use an
external aggregator.

OK I'm merging Tim's patch now.

Thanks!
Willy

Reply via email to