Hi guys,

On Tue, May 07, 2019 at 10:40:17PM +0200, William Lallemand wrote:
> Hi Patrick,
> 
> On Tue, May 07, 2019 at 02:23:15PM -0400, Patrick Hemmer wrote:
> > So with the prevalence of the issues lately where haproxy is going 
> > unresponsive and consuming 100% CPU, I wanted to see what thoughts were 
> > on implementing systemd watchdog functionality.

First, let me tell you I'm also all for a watchdog system. For me, an
unresponsive process is the worst thing that can ever happen because
it's the hardest one to detect and it takes time to fix it. This is also
why I've been working on a lockup detection for the worker processes that
is able to produce some context info and possibly an analysable core dump.
I expect to have it for 2.0-final, this is important to accelerate finding
of such painful bugs and fix them early if any remains.

> The master uses a special backend, invisible to the user, which contains 1
> server per worker, it uses the socketpair of the worker for the address. They
> are always connected and they can communicate. This architecture allows to
> forward commands to the CLI of the worker.
> 
> One of my ideas was to do the equivalent of adding a "check" keyword for each
> of these server line. We would have to implement a special check which will
> send a CLI command and wait for its response.
> 
> If one of the server does not respond, we could execute the exit-on-failure
> procedure.

I'd like that we keep a trace of the failed process. Send it a SIGXCPU or
SIGABRT, and kill the other ones cleanly.

> > The last idea would be to have the watchdog watch the master only, and 
> > the master watches the workers in turn. If a worker stops responding, 
> > the master would restart just that one worker.
> > 
> 
> That's not a good idea to restart only one worker, and that's not possible 
> with
> the current architecture, and too much complicated. In my opinion it's better
> to kill everything so systemd can restart properly with Restart=on-failure,
> this is what is done when one of the worker segfault, for example.

I totally agree. In the past when nbproc was used a lot, we've had many
reports of people getting caught by one process dying once in a while,
till the point where there were not enough processes left to handle the
traffic, making the service barely responsive but still up. This gives
a terrible image of a hosted service outside, while a dead process would
be detected, failed-over or restarted.

Cheers,
Willy

Reply via email to