Hi all, On Wed, Jun 11, 2014 at 10:13:06AM +0200, Lukas Tribus wrote: > Hi John, > > > > > Hi, we've been using haproxy 1.5 for quite a while, and haven't really > > run into any major issues until we upgraded from dev24 to dev25. > > Starting with dev25 we saw an issue where haproxy doesn't seem to be > > reliably closing connections after sending a response if the client > > uses keepalive. The same happens with dev26. > > Ok. > > > > > Unfortunately I'm unable to replicate the issue on our test servers, > > but what we see happen on production is that the number of open > > connections slowly rises over time > > We will have to find some non-intrusive way to debug this in production > then. > > > > > and the haproxy processes use more and more memory until the OOM killer > > starts killing them. > > Given your config this looks like frontend connections to me, and you maxconn > values are to high (OOM killer should never intervene). How much RAM does > your box have (and is maxconn 100000 the number you use in production)? > > > > > If we reload haproxy, the old process will then stick around until it's > > manually > > killed. > > This gives us a good possibility to troubleshoot with an old, stuck process > while production traffic is handled by a new process. > > > Please: > - use dev26 (making sure you have all recent bugfixes) and provide > "./haproxy -vv" output > - reproduce the issue (let haproxy accumulate some "broken" sessions) > - change the stats socket path in the config file > - reload haproxy > - wait for ~ 3 minutes to timeout the remaining non-broken sessions > - connect to the stats socket of the old (!) process and provide > the outputs [1]: > echo "show info;show stat;show pools;show sess" | socat stdio > unix-connect:/var/run/haproxy.sock > - attach GDB to the old process (triple check that its not the new > process!), post the ouput (gdb </path/to/haproxy> <pid>) > > > You should probably set timeout http-keep-alive [2] and timeout > http-request [3], but lets find the real culprit here first.
Marcus (CCed) reported me exactly the same issue a few days ago, the issue went away when he added the timeout http-keep-alive. I suspect it's another nasty side effect of the "improvement" we made to make the CD vs SD flags more accurate but I could be wrong :-/ I'll try here, I suspect it's easy to reproduce with simply the 3 basic timeouts and a single server in HTTP mode. Willy

