Re: Problem with binding UNIX listeners before checking PID

Eric Wong Sun, 03 Oct 2010 21:26:40 -0700

Jordan Ritter <[email protected]> wrote:
> Howdy.
> 
> I have lately been frustrated by the following use case:
> 
>       1. Run nginx/unicorn in production, listening on a UNIX socket
>       with a defined pid file.  Things run good.
>       2. Someone pushes code, unicorn restarts just fine, workers are
>       all up and running.
>       3. But someone is suspicious, or maybe they forget which
>       box they're logged into, so they invoke unicorn manually.
>       Same directory, same settings.
> 
>       4. It looks like the pid file check kicked in, because unicorn
>       refuses to boot - hey, it's already running, bugger off.  great.
>       5. BUT, this happened *after* the listener processing: the
>       manually-invoked unicorn unlinks the real unicorn master's unix
>       listener, so it's left dead in the water and everybody loses.
> 
> unicorn master doesn't know its listener is actually gone (but lsof shows
> open unix socket fd, netstat shows unix socket still present, so cursory
> investigation is misleading), but nginx keeps spewing ECONNREFUSEDs
> because the unix socket it's hitting belongs to that accidental unicorn
> instance that already decided not to stick around.
> 
> I think this is effectively about a behavioral difference in
> Unicorn::SocketHelper#bind_listen around the handling of UNIX vs. TCP
> sockets (this doesn't happen with TCP sockets because there's no
> unlink/disconnect step), and the fact that HttpServer#start evaluates
> the listener config before the PID path/config.
> 
> Now I see comments in and around HttpServer#initialize talking about races
> wrt binding to the listener and whatnot, and being newish to the codebase
> I admit I haven't yet fully absorbed all the considerations at play.
> 
> But I think it's fair to say that killing the listener(s) (in the UNIX
> socket case) before discovering you shouldn't have run in the first place
> (from the PID file) qualifies as buggy/bad/broken behavior.


Hi Jordan,

Thanks for the detailed bug report.  I knew from experience with other
daemons that lingering UNIX sockets caused troubles for some users, but
I failed to take into account the case where a user mistakenly starting
the process twice.

Yes, getting pid file writing/ordering "right"[1] is very tricky.

> I might suggest simply swapping their processing order in #start, but
> given the complexity of in-place restarts and other race considerations,
> I have doubts solving this would be that easy.

That wouldn't work if pid files weren't in use at all.

> Any thoughts/ideas?

A simpler check would be to use connect(2) (but not make any HTTP request)
to see if the socket is alive.  Patch coming.

[1] - I don't believe there actually is a way to always be right,
      just less bad/broken than the alternatives.
-- 
Eric Wong
_______________________________________________
Unicorn mailing list - [email protected]
http://rubyforge.org/mailman/listinfo/mongrel-unicorn
Do not quote signatures (like this one) or top post when replying

Re: Problem with binding UNIX listeners before checking PID

Reply via email to