On Mon, 26 Oct 2020 at 13:05, Peter Statham
<[email protected]> wrote:
>
> On Mon, 19 Oct 2020 at 10:00, Christopher Faulet <[email protected]> wrote:
> >
> > Le 16/10/2020 à 10:04, Christopher Faulet a écrit :
> > > Le 13/10/2020 à 14:53, Peter Statham a écrit :
> > >> Hello,
> > >>
> > >> We've found an issue when using agent checks in conjunction with the 
> > >> weighted
> > >> least connections algorithm in multithreaded mode.  It seems to me as if 
> > >> it is
> > >> possible for next_eweight in struct server to be modified in another 
> > >> thread
> > >> during the execution of fwlc_srv_reposition.  If next_eweight is set to 
> > >> zero
> > >> then a division by zero occurs on line 59 in src/lb_fwlc.c in 
> > >> fwlc_queue_srv.
> > >>
> > >> I notice that in haproxy-2.0.18 this section of code is protected by
> > >> HA_SPINLOCKs and I've been unable to replicate this issue in that 
> > >> version.
> > >>
> > >> I've written an agent (attached), bad_agent.py, which provokes this 
> > >> condition by
> > >> switching randomly between 1 and 0 percent.  I also include a minimal
> > >> configuration, cfg (also attached), which seems sufficient to cause the 
> > >> issue.
> > >> With these two running “ab -n 5000000 -c 500 http://192.168.92.1:8080/” 
> > >> will
> > >> quickly crash the haproxy process.
> > >>
> > >> I include links to a coredump and the binary that generated it 
> > >> (unstripped).
> > >> The backtrace of thread 1 follows.
> > >>
> > >
> > > Hi,
> > >
> > > Thanks for the reproducer. I'm able to crash HAProxy too using your 
> > > config and
> > > your agent. It seems to only crash on the 1.8. I'll investigate.
> > >
> >
> > Hi,
> >
> > In fact, it fails in all branches supporting the threads. The leasconn and 
> > first
> > loadbalancing algorithms are affected by this bug. In leastconn, it may 
> > crash
> > because of the division by 0 when the server weight is set to 0. But for the
> > both algos, the server tree may be also corrupted, leading to stranger and
> > undefined bugs.
> >
> > I pushed a fix (commit 26a52a) and backported it as far as 1.8. So, it 
> > should be
> > fixed in all branches now.
> >
> > Thanks !
> > --
> > Christopher Faulet
>
> Thank you for making a patch for this bug, Christopher.  I've checked
> out the 1.8 master (I would have done so sooner, but I'm afraid I
> didn't have access to my email last week) and I'm happy to say I can't
> replicate the crash. :)
>
> --
> Peter Statham

Hi,

I might have spoken too soon.

The latest release of 1.8 works flawlessly on my debian desktop but
still crashes when I attempt the same configuration on a CentOS
virtual machine on our VMWare cluster.

I'm not sure if this is down to differences in the way memory fencing
or thread scheduling work on these platforms or if it is a
library/compiler issue.  Backporting the LBPRM spinlocks from 1.9's
src/lb_fwlc.c seems to help but I will continue investigating and
hopefully rule out some of the other possibilities.

-- 
Peter Statham
Loadbalancer.org Ltd.

Reply via email to