Re: 1.5 badly dies after a few seconds

R.Nagy József Sat, 18 Sep 2010 06:41:03 -0700

Hi Joe,

On Thu, Sep 16, 2010 at 04:49:00PM +0200, R.Nagy József wrote:

Some more details, let the production server suffer 2 more times to
test a narrowed down config.
The new config only worked as a rate limiter 1.5.dev haproxy instance,
and had a running 1.3 instance in the background doing the real
backend game.


I really appreciate your involvement in trying to get this issue solved.

You are most welcome

So for the 1.5 rate limiter -still dieing- config was narrowed down to:

global
        log     127.0.0.1       daemon  debug
        maxconn 1024
        chroot /var/chroot/haproxy2
        uid 99
        gid 99
        daemon
        quiet
        pidfile /var/run/haproxy-private2.pid


One thing could be very useful, it would be to add the stats socket here
in the global section :

        stats socket /tmp/haproxy.sock level admin mode 666
        stats timeout 1d

Then using the "socat" tool, you can connect to it and launch some
commands to inspect the internal state :

 $  socat readline unix-connect:/tmp/haproxy.sock
 prompt
 > show info
 > show stat
 > show sess
 > show table
 > show table mySite-webfarm

I'm particularly interested in those outputs, they will make it easier
to find if we're facing a memory corruption, a resource shortage or any
such trouble. If it's easier for you, you can also chain all the commands
at once and avoid long copy-pastes :

Okay, just did this (output below), but please confirm if you want meto perform this when haproxy died already.


Name: HAProxy
Version: 1.5-dev2
Release_date: 2010/08/28
Nbproc: 1
Process_num: 1
Pid: 6478
Uptime: 0d 0h51m11s
Uptime_sec: 3071
Memmax_MB: 0
Ulimit-n: 2061
Maxsock: 2061
Maxconn: 1024
Maxpipes: 0
CurrConns: 1
PipesUsed: 0
PipesFree: 0
Tasks: 3
Run_queue: 1
node: kim.mysite.com
description:

#pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,track

ed,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,
mySite-webfarm,FRONTEND,,,0,1,3000,2,1478,876,0,0,0,,,,,OPEN,,,,,,,,,1,1,0,,,,0,0,0,1,,,,0,0,2,0,0,0,,0,1,2,,,
mySite-webfarm,realhost,0,0,0,1,,2,1478,876,,0,,0,0,0,0,UP,1,1,0,0,0,3071,0,,1,1,1,,2,,2,0,,1,L4OK,,0,0,0,2,0,0,0,0,,,,0,0,
mySite-webfarm,BACKEND,0,0,0,1,3000,2,1478,876,0,0,,0,0,0,0,UP,1,1,0,,0,3071,0,,1,1,0,,2,,1,0,,1,,,,0,0,2,0,0,0,,,,,0,0,
ease-up,BACKEND,0,0,0,0,0,0,0,0,0,0,,0,0,0,0,UP,0,0,0,,0,3071,0,,1,2,0,,0,,1,0,,0,,,,0,0,0,0,0,0,,,,,0,0,

0x2821f800: proto=unix_stream ts=09 age=0s calls=2rq[f=c08200h,l=41,an=00h,rx=1d,wx=,ax=]rp[f=008002h,l=1146,an=00h,rx=,wx=,ax=] s0=[7,8h,fd=7,ex=]s1=[7,0h,fd=-1,ex=] exp=1d


# table: mySite-webfarm, type: 0, size:1048576, used:1

# table: mySite-webfarm, type: 0, size:1048576, used:1
0x2823c110: key=93.86.32.57 use=0 exp=587042 gpc0=0 conn_rate(10000)=1

$ echo "show info;show stat;show sess;show table;show tablemySite-webfarm" | socat stdio unix-connect:/tmp/haproxy.sock >haproxy-debug.log


I'm just thinking about something else : there are basically two things
that change with the OS :

1) polling system

you may try to disable kqueue by adding "nokqueue" in the global section.
I don't think it's the issue because kqueue has not changed between 1.4
and 1.5 and there are some happy users of 1.4 on FreeBSD/OpenBSD.

will try later today


2) struct sizes

the pool allocator merges structs of similar sizes in the same pools. In
the past it has already happened that an uninitialized member that was
always zero caused no trouble on most platforms but caused crashes on
other ones due to it containing data from another use. You can check
pool sizes by starting haproxy in debug mode then issuing a kill -QUIT
on it :

  terminal1$ haproxy -db -f $file.cfg
  terminal2$ killall -QUIT haproxy

Haproxy will then dump all of its pools statistics to the stderr output.
You don't need to do that in production in fact, you can do that on a
test machine, because the output only depends on the binary itself and
not on the environment.

Done, result:
 /usr/local/sbin/haproxy.new -f /usr/local/etc/haproxy.test -db
Dumping pools usage.
  - Pool pipe (16 bytes) : 0 allocated (0 bytes), 0 used, 2 users [SHARED]

- Pool sig_handler (32 bytes) : 5 allocated (160 bytes), 5 used, 1users [SHARED]

  - Pool capture (64 bytes) : 0 allocated (0 bytes), 0 used, 1 users [SHARED]
  - Pool task (80 bytes) : 4 allocated (320 bytes), 3 used, 2 users [SHARED]

- Pool hdr_idx (832 bytes) : 1 allocated (832 bytes), 0 used, 2users [SHARED]- Pool session (960 bytes) : 1 allocated (960 bytes), 0 used, 1users [SHARED]- Pool requri (1024 bytes) : 1 allocated (1024 bytes), 0 used, 1users [SHARED]- Pool buffer (16480 bytes) : 2 allocated (32960 bytes), 0 used, 1users [SHARED]

Total: 8 pools, 36256 bytes allocated, 400 used.

And yeah, died with the same socks error message as yesterday.
(Server was hit by 30-40reqs/sec during this time, it died after ~30mins)


I noticed that the same error message can be found at two places. Could
you please adapt them both in order to also dump the FD value ? :

In src/session.c around line 215, please change :

  Alert("accept(): cannot set the socket in non blocking mode. Giving up\n");

with :

  perror("fcntl");

Alert("session_accept(): cannot set the socket %d in non blockingmode. Giving up\n", cfd);


And in src/frontend.c, around line 89, replace the same line with :

  perror("setsockopt");

Alert("frontend_accept(): cannot set the socket %d in non blockingmode. Giving up\n", cfd);


I'm almost sure it's frontend_accept() that returns the error, and I'm
interested in knowing the reported file descriptor which probably is
buggy, as well as the errno code.

Will mod code and recompile before testing through socks.


Thanks a lot for what you can do !
Willy

Thanks for

Re: 1.5 badly dies after a few seconds

Reply via email to