On Fri, 11 Dec 2020 at 15:21, Christopher Faulet <[email protected]> wrote:
>
> Le 11/12/2020 à 11:45, Christopher Faulet a écrit :
> > Le 10/12/2020 à 19:38, Peter Statham a écrit :
> >>   > Sorry for the delay in getting back to you.  It is the same crash,
> >>   > we've been trying to narrow down the exact combination of compiler,
> >>   > libraries, kernel, hypervisor, etc. that causes the issue now that we
> >>   > know it isn't universal but that's turning out to be trickier than
> >>   > identifying the issue.
> >>   >
> >>   > I only backported the changes to the src/lb_fwlc.c file, but
> >>   > backporting 1b87748ff5 seems to work just as well.  So far we haven't
> >>   > been able to provoke the issue with the changes in 1b87748ff5 applied
> >>   > to the 1.8 tree so that does look like a solution.
> >>   >
> >>   > We will keep testing and trying to narrow the issue down.
> >>
> >> Since I wrote the above I have managed to replicate the issue on 1.8 with
> >> applied, so it looks as if that was not the solution after all.
> >>
> >> I include a binary built from 1.8.27 with 1b87748ff5 backported and a core 
> >> dump.
> >>
> >> haproxy-1.8.27+1b87748ff5
> >> <https://drive.google.com/file/d/1KPs3rBpkeqE9GEOfjF8Ocycd1wa4RjqW/view?usp=drive_web>
> >> haproxy-1.8.27+1b87748ff5.core
> >> <https://drive.google.com/file/d/1chBPoogHBuGlnV1o5sO9YP6BldpRH4d3/view?usp=drive_web>
> >>
> >
> >
> > Thanks Peter, I'll try to take a look today. The reproducer is the same ?
> >
>
> Ok, in fact it is pretty easy to reproduce. Because I found a similar bug on
> newer versions, I have not tested on the 1.8. Unfortunately,  there is second
> bug, specific to the 1.8.
>
> I attached a patch that should fix it. In fact, the bug exists because of the
> rendez-vous point. It was removed on newer versions. But, on 1.8, there may 
> have
> a short time to commit server state changes because we must wait for all
> threads. Thus, we must take care to not use info of the next state too early.
> And this is the bug here. In the leasconn algo, the next server weight is 
> used,
>   instead of the current one, to reposition the server in the tree. The next
> server weight must only be used when the server state changes are committed.
>
> Peter, could you confirm it fixes you bug ?
> --
> Christopher Faulet

The patch seems to fix the issue.

I've built a new version of haproxy 1.8.27 with the patch applied on
both Debian and CentOS under VMWare.  I then ran these builds
concurrently with my previous builds on both platforms using
configuration files that are identical save for the bind address.

I can reproduce the bug with the existing build but not with the one
with your patch applied.

I'll ask some of my colleagues to double check my tests.

--

Peter Statham
Loadbalancer.org Ltd.

Reply via email to