Hi Willy,

Thanks for your response/debug details.

> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?

You are right! We have this: "timeout client-fin 30000ms"

> So at least you have 3 times 196 bugs in production :-)

And many 'x' times that, we have *lots* of servers to handle the Flipkart
traffic. Thanks for pointing out this information.

So we will upgrade after internal processes are sorted out. Thanks once
again for this quick information on the source of the problem.

Regards,
- Krishna


On Fri, May 19, 2017 at 10:34 AM, Willy Tarreau <[email protected]> wrote:

> Hi Krishna,
>
> On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering)
> wrote:
> > I saw many similar issues posted earlier by others, but could not find a
> > thread
> > where this is resolved or fixed in a newer release. We are using Ubuntu
> > 16.04
> > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> > TCP
> > connections, sometimes just 1 - a stale connection that does not seem to
> > belong
> > to any frontend session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>    http://www.haproxy.org/bugs/bugs-1.6.3.html
>
> > The single connection has this session information:
> > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> > source=a.a.a.a:35297
> >   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
> >   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> > addr=b.b.b.b:5667
> >   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
> >   server=d.d.d.d (id=4) addr=d.d.d.d:5667
> >   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=<PAST>, running
> > age=12d11h)
> >   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=<NEVER>,
> > et=0x000)
> >   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=<NEVER>,
> > et=0x000)
> >   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
> >       flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
> >   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
> >       flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
> >   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
> >       an_exp=<NEVER> rex=? wex=<NEVER>
> >       buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
> >   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
> >       an_exp=<NEVER> rex=<NEVER> wex=<NEVER>
> >       buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0
>
>
> That's quite useful, thanks!
>
>  - connection with client is closed
>  - connection with server is still established and theorically stopped from
>    polling
>  - the request channel is closed in both directions
>  - the response channel is closed in both directions
>  - both buffers are empty
>
> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?
>
> I'm also noticing that the session is aged 12.5 days. So either it has
> been looping for this long (after all the function has been called 1
> billion times), or it was a long session which recently timed out.
>
> > We have 3 systems running the identical configuration and haproxy binary,
>
> So at least you have 3 times 196 bugs in production :-)
>
> > and
> > the 100% cpu is ongoing for the last 17 days on one system. The client
> > connection is no longer present. I am assuming that a haproxy reload
> would
> > solve this as the frontend connection is not present, but have not tested
> > it out yet. Since this box is in production, I am unable to do invasive
> > debugging
> > (e.g. gdb).
>
> For sure. At least an upgrade to 1.6.12 would get rid of most of these
> known bugs. You could perform a rolling upgrade, starting with the machine
> having been in that situation for the longest time.
>
> > Please let me know if this is fixed in a latter release, or any more
> > information that
> > can help find the root cause.
>
> For me everything here looks like the client-fin/server-fin bug that was
> fixed two months ago, so if you're using this it's very likely fixed. If
> not, there's still a small probability that the fixes made to better
> deal with wakeup events in the case of the server-fin bug could have
> addressed a wider class of bugs : often we find one way to enter a
> certain bogus condition and hardly imagine all other possibilities.
>
> Regards,
> Willy
>

Reply via email to