Re: ip_nonlocal_bind=1 set but sometimes get "cannot bind socket" on reload (-sf)

Chris Riley Fri, 30 Oct 2015 09:30:01 -0700

Hi Willy,

Thanks for your quick reply.


> It should work better but will very likely hide the root cause. I suspect
> you'll find two processes running after a reload because the old one
> doesn't stop then.

Yep, that's exactly what I'm seeing with 3.10. I've got a bunch of haproxy
processes stacked up, all with an -sf flag and each being passed the PID of
the previous process. The only one not in the process list is the initial
haproxy instance created when 'service haproxy start' was first run.

lb-01 ~ [qa] # ps ax | grep hap
 2834 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2822
 2871 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2859
 2883 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2871
 2910 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2896
 2922 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2910
 2947 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2934
 2959 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2947
 2962 pts/1    S+     0:00 grep --colour=auto hap


> That would confirm the possibility that the signal is not sent at all,
> or at least not to the right process. Could you check the exact command
> that is started, to ensure the pids are correct (or present at all) ?
> Can you also try by hand to first send SIGUSR1 to the old process,
> then perform the reload, then send SIGTTOU by hand to the old one ?
> If it works, it would confirm an issue with the ability to send a
> signal to the old process from the new one.

'service haproxy start' invokes this:

daemon $exec -D -f /etc/$prog/$prog.cfg -p /var/run/$prog.pid

which produces:

usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid

ps ax shows:

2822 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid

cat /var/run/haproxy.pid shows:
2822

'service haproxy reload' invokes this:

$exec -D -f /etc/$prog/$prog.cfg -p /var/run/$prog.pid -sf $(cat
/var/run/$prog.pid)

which produces:

2834 ?        Ss     0:00 /usr/sbin/haproxy -D -f /etc/haproxy/haproxy.cfg
-p /var/run/haproxy.pid -sf 2822

cat /var/run/haproxy.pid then shows:
2834

I'll try manually sending SIGUSR1 and SIGTTOU as you suggested and see if I
can determine what's happening.

Any chance that this issue is related to or that same as this one?

https://github.com/haproxy/haproxy/issues/48

Regards,
Chris

On Fri, Oct 30, 2015 at 11:50 AM, Willy Tarreau <[email protected]> wrote:

> Hi Chris,
>
> On Fri, Oct 30, 2015 at 11:18:30AM -0400, Chris Riley wrote:
> > Hi Willy,
> >
> > The permissions where one of the first things I checked. consul-template
> > runs as root in order to be able to reload/restart daemon and it's using
> > the same init script that the system uses on startup. Not all of the
> > reloads fail, the first few initial ones are successful. What's odd is
> that
> > the behavior goes away when I failover all IPs to one server and
> > set net.ipv4.ip_nonlocal_bind=0.
>
> That's really strange.
>
> > After that all reloads are successful, no
> > matter how many times in a row reload is called. The issue remains at bay
> > even after failing half of the IPs back over to the secondary server and
> > setting net.ipv4.ip_nonlocal_bind=1 again. That is until the servers
> > reboot, then the behavior returns. Vincent got me thinking about the
> 2.6.32
> > kernel that is part of CentOS 6.4. I'm wondering if
> > net.ipv4.ip_nonlocal_bind behaves oddly in 2.6.x with respect to the
> status
> > of existing socket file descriptors.
>
> No, nonlocal_bind hasn't changed for a while, what was brought later (3.9)
> was SO_REUSEPORT which haproxy always uses, so that makes it easier to
> rebind regardless of the presence of an old process.
>
> > I'm going to try kernel 3.10 from
> > CentOS 7 to see if I can reproduce it in 3.10 in order to rule out or
> > confirm an issue with the kernel.
>
> It should work better but will very likely hide the root cause. I suspect
> you'll find two processes running after a reload because the old one
> doesn't
> stop then.
>
> > However, I'm not sure that's the issue. When a reload fails there is
> > nothing in the log file that indicates that haproxy saw SIGTTOU or
> SIGUSR1
> > ("Pausing %s %s." and "Stopping %s %s in %d ms."). I can reproduce this
> > behavior if I don't provide a PID to -sf. When looking at the code
> > in proxy.c it looks like pause_proxy() is either not being called
> > by pause_proxies in haproxy.c (due to the missed SIGTTOU) or in
> > pause_proxy() the proxy state check is returning 1 at the top of the
> > pause_proxy() function. I'm going to add some additional logging
> statements
> > to see if I can isolate what's happening.
>
> That would confirm the possibility that the signal is not sent at all,
> or at least not to the right process. Could you check the exact command
> that is started, to ensure the pids are correct (or present at all) ?
> Can you also try by hand to first send SIGUSR1 to the old process,
> then perform the reload, then send SIGTTOU by hand to the old one ?
> If it works, it would confirm an issue with the ability to send a
> signal to the old process from the new one.
>
> Regards,
> Willy
>
>

Re: ip_nonlocal_bind=1 set but sometimes get "cannot bind socket" on reload (-sf)

Reply via email to