Hi Chris,
On Fri, Oct 30, 2015 at 11:18:30AM -0400, Chris Riley wrote:
> Hi Willy,
>
> The permissions where one of the first things I checked. consul-template
> runs as root in order to be able to reload/restart daemon and it's using
> the same init script that the system uses on startup. Not all of the
> reloads fail, the first few initial ones are successful. What's odd is that
> the behavior goes away when I failover all IPs to one server and
> set net.ipv4.ip_nonlocal_bind=0.
That's really strange.
> After that all reloads are successful, no
> matter how many times in a row reload is called. The issue remains at bay
> even after failing half of the IPs back over to the secondary server and
> setting net.ipv4.ip_nonlocal_bind=1 again. That is until the servers
> reboot, then the behavior returns. Vincent got me thinking about the 2.6.32
> kernel that is part of CentOS 6.4. I'm wondering if
> net.ipv4.ip_nonlocal_bind behaves oddly in 2.6.x with respect to the status
> of existing socket file descriptors.
No, nonlocal_bind hasn't changed for a while, what was brought later (3.9)
was SO_REUSEPORT which haproxy always uses, so that makes it easier to
rebind regardless of the presence of an old process.
> I'm going to try kernel 3.10 from
> CentOS 7 to see if I can reproduce it in 3.10 in order to rule out or
> confirm an issue with the kernel.
It should work better but will very likely hide the root cause. I suspect
you'll find two processes running after a reload because the old one doesn't
stop then.
> However, I'm not sure that's the issue. When a reload fails there is
> nothing in the log file that indicates that haproxy saw SIGTTOU or SIGUSR1
> ("Pausing %s %s." and "Stopping %s %s in %d ms."). I can reproduce this
> behavior if I don't provide a PID to -sf. When looking at the code
> in proxy.c it looks like pause_proxy() is either not being called
> by pause_proxies in haproxy.c (due to the missed SIGTTOU) or in
> pause_proxy() the proxy state check is returning 1 at the top of the
> pause_proxy() function. I'm going to add some additional logging statements
> to see if I can isolate what's happening.
That would confirm the possibility that the signal is not sent at all,
or at least not to the right process. Could you check the exact command
that is started, to ensure the pids are correct (or present at all) ?
Can you also try by hand to first send SIGUSR1 to the old process,
then perform the reload, then send SIGTTOU by hand to the old one ?
If it works, it would confirm an issue with the ability to send a
signal to the old process from the new one.
Regards,
Willy