Hello Dariusz,

On Fri, Mar 19, 2010 at 08:32:07PM +0100, Dariusz Suchojad wrote:
> 
> Hello,
> 
> first off - thank you very much for such a wonderful piece of software! 
> I'm using HAProxy on Linux/x86 and it's been a wonderful experience so far.
> 
> Things are a bit different on z/Linux, using s390 architecture. HAProxy 
> was running very nicely for several hours and then suddenly started to 
> segfault and hang after I had to restart it.

Hmmm really bad :-(

> Here's some background information:
> 
> $ uname -a
> Linux vlbt12 2.6.18-92.el5 #1 SMP Tue Apr 29 13:16:58 EDT 2008 s390x 
> s390x s390x GNU/Linux
> $
> 
> $ ./src/haproxy-1.4.2/haproxy -vv
> HA-Proxy version 1.4.2 2010/03/17
> Copyright 2000-2010 Willy Tarreau <[email protected]>
> 
> Build options :
>   TARGET  = linux26
>   CPU     = generic
>   CC      = gcc
>   CFLAGS  = -O2 -g
>   OPTIONS =
> 
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200
> 
> Encrypted password support via crypt(3): yes
> 
> Available polling systems :
>      sepoll : pref=400,  test result OK
>       epoll : pref=300,  test result OK
>        poll : pref=200,  test result OK
>      select : pref=150,  test result OK
> Total: 4 (4 usable), will use sepoll.
> 
> $

nothing unusual here.

> Configuration file:
> 
> global
>     log 127.0.0.1:514 local0 debug
> 
> defaults
>     log global
>     option httpclose
>     timeout connect 5000
>     timeout client 5000
>     timeout server 5000
> 
> backend http-plain-bck
>     mode http
>     balance roundrobin
>     server httpplain01 127.0.0.1:7081 check inter 2s rise 2 fall 2
>     server httpplain02 127.0.0.1:7082 check inter 2s rise 2 fall 2
>     option httpchk GET /http-check
> 
> 
> frontend plain
>     bind 0.0.0.0:37080
>     mode http
>     maxconn 300
>     default_backend http-plain-bck
>     option httplog

nor here.

> The client application is SoapUI http://soapui.com - they're using an 
> embedded Jakarta Commons HTTP client 3.0.1 underneath. The servers are 
> IBM's proprietary Message Brokers. I'm sending HTTP POST data which is 
> then proxied over two backends.
> 
> I've read http://www.formilux.org/archives/haproxy/0910/2562.html but I 
> think my case is a bit different as I have no problems with starting 
> HAProxy. And unfortunately I'm not able to use Valgrind as it seems not 
> to be supported on that architecture.

I believe it was ElectricFence which used to add unmapped memory areas
around allocated arrays to catch overflows. Maybe that would work on
your platform ?

> When the 'hang' happens it's just that - the client waits for several 
> dozens of seconds and I'm able to send the SIGQUIT signal, here's the 
> output from HAProxy upon receiving it:
> 
> $ ./src/haproxy-1.4.2/haproxy -V -f ./problematic.conf
> Available polling systems :
>      sepoll : pref=400,  test result OK
>       epoll : pref=300,  test result OK
>        poll : pref=200,  test result OK
>      select : pref=150,  test result OK
> Total: 4 (4 usable), will use sepoll.
> Using sepoll() as the polling mechanism.
> 
> Dumping pools usage.
>   - Pool pipe (32 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED]
>   - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users 
> [SHARED]
>   - Pool task (144 bytes) : 6 allocated (864 bytes), -19793676 used, 1 
> users [SHARED]
>   - Pool hdr_idx (832 bytes) : 4 allocated (3328 bytes), -19793678 
> used, 2 users [SHARED]
>   - Pool requri (1024 bytes) : 3 allocated (3072 bytes), 2 used, 1 
> users [SHARED]
>   - Pool session (1344 bytes) : 3 allocated (4032 bytes), -19793678 
> used, 1 users [SHARED]
>   - Pool buffer (16512 bytes) : 8 allocated (132096 bytes), -39587356 
> used, 1 users [SHARED]
> Total: 7 pools, 143392 bytes allocated, 9081850944 used.
> 
> I'm not sure how to interpret it but the numbers look a bit odd, don't 
> they? 143392 bytes have been allocated yet 9081850944 have been used?

It means that your memory got corrupted. After that, anything can
happen. That explains why you see segfaults, freezes (eg: infinite
timeouts with wrong session flags), etc...

> And here's what syslog shows just before I send SIGQUIT

(...)
> [19/Mar/2010:19:29:22.489] plain http-plain-bck/httpplain01 0/0/0/31/32 
> 404 428 - - ---- -18798/-18798/2/0/0 0/0 "POST /customer/profile HTTP/1.1"

This one show that the proxy struct was corrupted too.

(...)
> I have attached some basic data from a gdb session.

I'm afraid it will not help in case of memory corruption, because once
the corruption is done, anything can happen and the issue will trigger
a lot later on an innocent piece of code.

> I'd love to give 
> more information but I'm not sure what I should do next, can you please 
> guide me a bit here? What should I take a look at now?

One this that could happen would be that haproxy detects an invalid
request or response, captures it but overflows for whatever reason and
corrupts some frontend/backend data. You could check that by enabling
the stats socket in your global config, and connecting to it that way :

   echo show errors | socat stdio unix-connect:/tmp/socket

(assuming you set it to /tmp/socket)


We had one similar corruption issue during the development phase, that
was fixed. The issue was that we could incorrectly compute the max size
for a recv() call, then pass a negative value, which cas understood by
the kernel as a large positive value. When the socket buffers had enough
data pending, the data could overflow the recv buffer and corrupt
anything.

There's something easy you can do to check if it's that : in
src/stream_sock.c, there's only one recv() call. Simply check
that the max value is within bounds :

+               if (max < 0 || max > b->size)
+                       abort();
                ret = recv(fd, b->r, max, 0);

If you believe you can reproduce it, doing it under strace could
immensely help : "strace -tt -s 200 -o trace.log haproxy -[args]".

Also, do you see any build warning ? It's possible that we have
one type wrong somewhere which is different on your platform. I
once got caught by unsigned chars on PPC for instance.

Last, are you aware of any version that has worked reliably on
your platform ?

Thanks for your very detailed report,
Willy


Reply via email to