This just hit us again on a different set of load balancers... if there's a
listen socket overflow on a domain socket during graceful, haproxy
completely deletes the domain socket and becomes inaccessible.

On Tue, Feb 21, 2017 at 6:47 PM, James Brown <[email protected]> wrote:

> Under load, we're sometimes seeing a situation where HAProxy will
> completely delete a bound unix domain socket after a reload.
>
> The "bad flow" looks something like the following:
>
>
>    - haproxy is running on pid A, bound to /var/run/domain.sock (via a
>    bind line in a frontend)
>    - we run `haproxy -sf A`, which starts a new haproxy on pid B
>    - pid B binds to /var/run/domain.sock.B
>    - pid B moves /var/run/domain.sock.B to /var/run/domain.sock (in
>    uxst_bind_listener)
>    - in the mean time, there are a zillion connections to
>    /var/run/domain.sock and pid B isn't started up yet; backlog is exhausted
>    - pid B signals pid A to shut down
>    - pid A runs the destroy_uxst_socket function and tries to connect to
>    /var/run/domain.sock to see if it's still in use. The connection fails
>    (because the backlog is full). Pid A unlinks /var/run/domain.sock.
>    Everything is sad forever now.
>
> I'm thinking about just commenting out the call to destroy_uxst_socket
> since this is all on a tmpfs and we don't really care if spare sockets are
> leaked when/if we change configuration in the future. Arguably, the
> solution should be something where we don't overflow the listen socket at
> all; I'm thinking about also binding to a TCP port on localhost and just
> using that for the few seconds it takes to reload (since otherwise we run
> out of ephemeral sockets to 127.0.0.1); it still seems wrong for haproxy to
> unlink the socket, though.
>
> This has proven extremely irritating to reproduce (since it only occurs if
> there's enough load to fill up the backlog on the socket between when pid B
> starts up and when pid A shuts down), but I'm pretty confident that what I
> described above is happening, since periodically on reloads the domain
> socket isn't there and this code fits.
>
> Our configs are quite large, so I'm not reproducing them here. The reason
> we bind on a domain socket at all is because we're running two sets of
> haproxies — one in multi-process mode doing TCP-mode SSL termination
> pointing back over a domain socket to a single-process haproxy applying all
> of our actual config.
>
> --
> James Brown
> Systems ​
> Engineer
>



-- 
James Brown
Engineer

Reply via email to