Ok - I will happily test more patches to pin those issues down :)

There is one thing where I'm not sure if its related. We have currently set up the system with a rather small msl setting:

net.inet.tcp.msl=5000

Also we have fast_finwait2_recycle enabled so we don't have that many TIME_WAIT and FIN_WAIT_2 connections hanging around.

Now a few days ago I read that reducing the TIME_WAIT could actually be harmful:

http://vincent.bernat.im/en/blog/2014-tcp-time-wait-state-linux.html

And I wonder if its somehow related or at least provoking this kind of bug/edge case. I have to admit that I reach my levels of competence here so this is really just a wild guess but maybe you can confirm or deny that assumption.

Kind regards,

John

Willy Tarreau wrote:
On Thu, May 08, 2014 at 01:46:30PM +0200, Willy Tarreau wrote:
On Thu, May 08, 2014 at 12:47:06PM +0200, John-Paul Bader wrote:
Ok after about roughly another 2 hours the next core dump happened -
this time with the nokqueue option set.

The coredump looks very similar and now crashes with ev_poll instead of
ev_kqueue:
Indeed it's exactly the same.

(gdb) bt full
#0  kill_mini_session (s=0x80337f800) at src/session.c:299
        level = 6
        conn = (struct connection *) 0x0
        err_msg =<value optimized out>
#1  0x0000000000463928 in conn_session_complete (conn=0x802e29480) at
src/session.c:355
        s = (struct session *) 0x80337f800
#2  0x0000000000432769 in conn_fd_handler (fd=<value optimized out>) at
src/connection.c:88
        conn =<value optimized out>
        flags = 41997063
That's 0x0280d307 :

   - {SOCK,DATA,CURR}_RD_ENA=1       =>  it's a handshake, waiting for reading
   - {SOCK,DATA,CURR}_WR_ENA=0       =>  no need for writing
   - CTRL_READY=1                    =>  FD is still allocated
   - XPRT_READY=1                    =>  transport layer is initialized
   - ADDR_FROM_SET=1, ADDR_TO_SET=0  =>  clearly it's a frontend connection
   - INIT_DATA=1, WAKE_DATA=1        =>  processing a handshake (ssl I guess)
   - {DATA,SOCK}_{RD,WR}_SH=0        =>  no shutdown
   - ERROR=0, CONNECTED=0            =>  handshake not completed yet
   - WAIT_L4_CONN=0                  =>  normal
   - WAIT_L6_CONN=1                  =>  waiting for an L6 handshake to complete
   - SSL_WAIT_HS=1                   =>  the pending handshake is an SSL 
handshake

So at least what we know is that it receives an SSL handshake, and that
it dies while processing it and before any error is reported on the
connection. I'll recheck the error path there in case I find anything.

In fact it's a bit more precise, flags is the copy of connection->flags when
entering the function. It indicates a handshake is in progress. And the only
way to reach line 88 is for the handshake to complete without error. So we
know for sure that ssl_sock_handshake() was called and completed the
handshake then removed the CO_FL_SSL_WAIT_HS flag from the connection.

And that becomes interesting, because ssl_sock_handshake() is cut in several
parts. The first one handles renegociations. We could not enter it because
CONNECTED=0. Then SSL_do_handshake() is called. If it does not return exactly
1, it enters the second block which always returns without clearing the
handshake flag. So we also know we did not enter this place. Thus the
handshake succeeded and went to the reneg_ok label. The third block here
is dedicated to server-side, which is not the case here. And we leave...

So what causes the trouble is done within SSL_do_handshake(), which must
be one of our callbacks. Thus that could still be related to the session
cache in fact. Maybe a corner case remains. I'll check that.

Regards,
Willy


--
John-Paul Bader | Software Development

www.wooga.com
wooga GmbH | Saarbruecker Str. 38 | D-10405 Berlin
Sitz der Gesellschaft: Berlin; HRB 117846 B
Registergericht Berlin-Charlottenburg
Geschaeftsfuehrung: Jens Begemann, Philipp Moeser

Reply via email to