Hi John-Paul,

On Wed, May 07, 2014 at 09:22:32AM +0200, John-Paul Bader wrote:
> Hey Willy,
> 
> 
> this morning I was running another test without kqueue but sadly with 
> the same result.

OK so let's rule out any possible kqueue issue there for now.

> Here is my test protocol:
> 
> Running fine with nokqueue for about an hour at about 20% CPU per 
> process, then sudden CPU spike on all processes up to 90%, I started 
> ktrace but meanwhile the CPU went back to around 33% on each process 
> [2]. Then after 10 more minutes 3 of the 8 haproxy processes died with a 
> segfault.
> 
> kernel: pid 3963 (haproxy), uid 0: exited on signal 11 (core dumped)
> 
> Unfortunately the coredump [1] is not that expressive even with compiled 
> debug symbols.

It's very interesting, it contains a call to ssl_update_cache(). I didn't
know you were using SSL, but in multi-process mode we have the shared context
model to share the SSL sessions between processes. On Linux, we almost only
use futexes. On other systems, we use mutexes. So that's a difference. It
might be possible that we have a bug in the mutex implementation causing
various effects.

You could try to rebuild with the private cache mode, but it will be a bit
more complicated, because if you have a high load, I guess you want to keep
your users sessions. So you'll probably need to have one front shared process
running in TCP mode and distributing the load to the SSL processes according
to the SSL ID, in order to maintain stickiness between users and processes.

The fact that you have no symbols in your gdb output indicates that the
crash very likely happens inside libssl, maybe it retrieves some crap from
the session cache that it cannot reliably deal with.

> The remaining 5 processes survived another 10 minutes before they ramped 
> up cpu again - this time up to 100%. I have created a ktrace in this 
> state before reaching the 100%. [3]
> 
> When they were at 100% and not accepting any requests anymore I took 
> another ktrace sample but _nothing_ was written to the output anymore! 

That could indicate an attempt to acquire a lock in loops, or simply
that the code is looping in userspace due to a side effect of some memory
corruption consecutive to the bug for example.

> That means in this state no syscalls where happening anymore? I also 
> took a full ktrace sample with IO and everything - it was empty as well.

Oh and BTW, I can confirm that ktrace is really poor compared to strace :-)

> So it seems unrelated to kqueue as well. Later I will try to run the 
> test with a fraction of the traffic without nbproc (all the traffic is 
> too much for one process)

That would be great! You can try to build with "USE_PRIVATE_CACHE=1" in
order to disable session sharing.

If you have plenty of clients, you can first try to spread the load between
processes using a simple source hash :

listen front
   bind-process 1
   bind pub_ip:443
   balance source
   server proc2 127.0.0.2:443 send-proxy
   server proc3 127.0.0.3:443 send-proxy
   server proc4 127.0.0.4:443 send-proxy
   server proc5 127.0.0.5:443 send-proxy
   server proc6 127.0.0.6:443 send-proxy
   server proc7 127.0.0.7:443 send-proxy
   server proc8 127.0.0.8:443 send-proxy

frontend proc2
   bind-process 2
   bind 127.0.0.2:443 ssl crt ... accept-proxy
   ... usual stuff

frontend proc3
   bind-process 3
   bind 127.0.0.3:443 ssl crt ... accept-proxy
   ... usual stuff

etc.. till process 8.

It's much easier than dealing with SSL ID and might be done with
less adjustments to your existing configuration. And that way you
don't need to share any SSL context between your processes.

Please tell me if you need some help to try to set up something like this.

Willy


Reply via email to