Hi all,

On Wed, Jun 11, 2014 at 10:13:06AM +0200, Lukas Tribus wrote:
> Hi John,
> 
> 
> 
> > Hi, we've been using haproxy 1.5 for quite a while, and haven't really
> > run into any major issues until we upgraded from dev24 to dev25.
> > Starting with dev25 we saw an issue where haproxy doesn't seem to be
> > reliably closing connections after sending a response if the client
> > uses keepalive. The same happens with dev26.
> 
> Ok.
> 
> 
> 
> > Unfortunately I'm unable to replicate the issue on our test servers,
> > but what we see happen on production is that the number of open
> > connections slowly rises over time
> 
> We will have to find some non-intrusive way to debug this in production
> then.
> 
> 
> 
> > and the haproxy processes use more and more memory until the OOM killer
> > starts killing them.
> 
> Given your config this looks like frontend connections to me, and you maxconn
> values are to high (OOM killer should never intervene). How much RAM does
> your box have (and is maxconn 100000 the number you use in production)?
> 
> 
> 
> > If we reload haproxy, the old process will then stick around until it's 
> > manually
> > killed.
> 
> This gives us a good possibility to troubleshoot with an old, stuck process
> while production traffic is handled by a new process.
> 
> 
> Please:
> - use dev26 (making sure you have all recent bugfixes) and provide
>   "./haproxy -vv" output
> - reproduce the issue (let haproxy accumulate some "broken" sessions)
> - change the stats socket path in the config file
> - reload haproxy
> - wait for ~ 3 minutes to timeout the remaining non-broken sessions
> - connect to the stats socket of the old (!) process and provide
>   the outputs [1]:
>   echo "show info;show stat;show pools;show sess" | socat stdio 
> unix-connect:/var/run/haproxy.sock
> - attach GDB to the old process (triple check that its not the new
>   process!), post the ouput (gdb </path/to/haproxy> <pid>)
> 
> 
> You should probably set timeout http-keep-alive [2] and timeout
> http-request [3], but lets find the real culprit here first.

Marcus (CCed) reported me exactly the same issue a few days ago, the issue
went away when he added the timeout http-keep-alive. I suspect it's another
nasty side effect of the "improvement" we made to make the CD vs SD flags
more accurate but I could be wrong :-/

I'll try here, I suspect it's easy to reproduce with simply the 3 basic
timeouts and a single server in HTTP mode.

Willy


Reply via email to