Hi Patrick,
On Thu, May 18, 2017 at 05:44:30PM -0400, Patrick Hemmer wrote:
>
> On 2017/1/17 17:02, Willy Tarreau wrote:
> > Hi Patrick,
> >
> > On Tue, Jan 17, 2017 at 02:33:44AM +0000, Patrick Hemmer wrote:
> >> So on one of my local development machines haproxy started pegging the
> >> CPU at 100%
> >> `strace -T` on the process just shows:
> >>
> >> ...
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> epoll_wait(0, {}, 200, 0) = 0 <0.000003>
> >> ...
> > Hmm not good.
> >
> >> Opening it up with gdb, the backtrace shows:
> >>
> >> (gdb) bt
> >> #0 0x00007f4d18ba82a3 in __epoll_wait_nocancel () from /lib64/libc.so.6
> >> #1 0x00007f4d1a570ebc in _do_poll (p=<optimized out>, exp=-1440976915)
> >> at src/ev_epoll.c:125
> >> #2 0x00007f4d1a4d3098 in run_poll_loop () at src/haproxy.c:1737
> >> #3 0x00007f4d1a4cf2c0 in main (argc=<optimized out>, argv=<optimized
> >> out>) at src/haproxy.c:2097
> > Ok so an event is not being processed correctly.
> >
> >> This is haproxy 1.7.0 on CentOS/7
> > Ah, that could be a clue. We've had 2 or 3 very ugly bugs in 1.7.0
> > and 1.7.1. One of them is responsible for the few outages on haproxy.org
> > (last one happened today, I left it running to get the core to confirm).
> > One of them is an issue with the condition to wake up an applet when it
> > failed to get a buffer first and it could be what you're seeing. The
> > other ones could possibly cause some memory corruption resulting in
> > anything.
> >
> > Thus I'd strongly urge you to update this one to 1.7.2 (which I'm going
> > to do on haproxy.org now that I could get a core). Continue to monitor
> > it but I'd feel much safer after this update.
> >
> > Thanks for your report!
> > Willy
> >
> So I just had this issue recur, this time on version 1.7.2.
OK. If it's still doing it, capturing the output of "show sess all" on
the CLI could help a lot.
I've looked at the changelog since 1.7.2 and other bugs have since been
fixed possibly responsible for this :
- c691781 ("BUG/MEDIUM: stream: fix client-fin/server-fin handling")
=> fixes a cause of 100% CPU when timeout client-fin/server-fin are
used
- 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into
buffe
=> fixes the computation of free buffer space, it's unknown if we've
ever hit this bug. It could be possible that it causes some applets
like stats to fail to write and to be called again immediately for
example.
- 57393fb ("BUG/MEDIUM: buffers: Fix how input/output data are injected into
buffe
=> some filters might be woken up to do nothing. Compression might trigger
this.
- there are a bunch of polling-related fixes which might have got rid of
such a bad situation on certain connect() cases, possibly over unix
sockets.
There are a few other fixes in the queue that I need to backport but none
of them is related to this. Thus if you're in emergency, 1.7.5 could help
by bringing the fixes above. If you can wait a few more days, I expect to
issue 1.7.6 early next week.
Cheers,
Willy