On Tue, Jul 19, 2011 at 1:23 AM, Willy Tarreau <[email protected]> wrote: > On Mon, Jul 18, 2011 at 06:33:33PM -0400, Jonathan Simms wrote: >> Willy, >> >> I looked at the previous bug report here >> http://comments.gmane.org/gmane.comp.web.haproxy/5439 >> based on 2.6.38 and checked the ubuntu 2.6.32 kernel for the offending patch >> <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=c191a836a908d1dd6b40c503741f91b914de3348> >> and I didn't see it applied to the kernel I'm using. > > OK then that's already a good thing, but we have to find out what > could cause a similar issue on a specific distro ! > >> Is there any other explanation, or some information I can find for you? > > Do all the listeners have the same issue or only a few ? And did the > config change between the working one and the reloaded one ? What could > cause the same issue to happen is a copy-paste of a "bind" line in the > same file, which would cause a conflict when trying to bind the second > one. > > Also, please check that you're don't have more than one process running > when the issue appears. It could be that another old process still holds > the ports open and does not get the signal to release them. But this would > be surprising considering that your config only allows one process. > > In your trace below, you only have the expected part : > > 17492 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 5 > 17492 fcntl(5, F_SETFL, O_RDONLY|O_NONBLOCK) = 0 > 17492 setsockopt(5, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 > 17492 setsockopt(5, SOL_SOCKET, 0xf /* SO_??? */, [1], 4) = -1 > ENOPROTOOPT (Protocol not available) > 17492 bind(5, {sa_family=AF_INET, sin_port=htons(6379), > sin_addr=inet_addr("0.0.0.0")}, 16) = -1 EADDRINUSE (Address > already in use) > 17492 close(5) = 0 > 17492 kill(13512, SIGTTOU) = 0 > > The first bind() fails, then the new process sends a SIGTTOU signal to the > old one asking it to release the ports, then haproxy tries to bind again for > a certain time, and only complains if it fails for too long. Ideally, a full > strace of the issue could help, but please take it with "strace -tt" so that > we get the timers. > > Regards, > Willy
Willy, Ok, I think the issue is not related to ubuntu or haproxy, but rather the execution environment under which I was attempting to restart haproxy. I'm working on a cluster control program that can reconfigure and restart HAproxy so that we can do rolling builds in an automated fashion. We're using ruby's EventMachine library, and I was executing the init.d script via their popen method. When I rewrote my restarter to instead fork(), clean the environment, reset all signal handlers, and exec the -sf reload, things started working properly. I'm really at a loss to explain *why* using their popen caused the issue, I really don't have enough knowledege in that realm to diagnose what's happening. I can only guess that it's "something in the environment" causing the issue (vague enough for you? :) ). Eventmachine is a framework for writing non-blocking asynchronous (single-threaded) event-driven services, so I could imagine that there might be some kind of conflict (a signal handler maybe?) caused by that. I know, it's a weak explanation. The changes between configurations being reloaded were just adding 'disabled' keywords on the appropriate server lines for the appropriate backends, so it wasn't that I was changing bind clauses. I'm thinking that since this is such a special case, it probably doesn't have implications for most other haproxy users, however if it would be useful for you to see (or if you're curious) I'd be happy to do run an strace of the process and send the results to you. Anyway, thank you very much for taking the time to reply and I'm sorry to have led us on something of a wild goose chase :) Regards, Jonathan

