On 13.03.2015 23:33, Laurent Bercot wrote:
On 11/03/2015 08:45, Natanael Copa wrote:
With that in mind, wouldn't it be better to have the timer code in
the handler/parser? When there comes no new messages from pipe
within a given time, the handler/parser just exists.

I've thought about that a bit, to see if there really was value in
making the handler exit after a timeout. And it's a lot more complex
than it appears, because you then get respawner design issues, the
same that appear when you write a supervisor.

Which issues?

What if the handler dies too fast and there are still events in the
queue ?

Should you respawn the handler instantly ?

spawning the handler is the job of the named pipe supervisor. At first
it checks the exit code of the dieing handler and spans a failure script
if not sucessfull. Then waits until data in pipe arrive (or is still
there = poll for reading), and finally span a new handler

The trick on this is, to hold the pipe open for reading and writing in
the supervisor. This way you avoid race conditions from recreating new
pipes, and catch even situation when an event arrive at the moment the
handler got a timeout and is dieing. Otherwise, does the supervisor not
touch the content transfered through the pipe.

That's exactly the kind of load you're trying to avoid by having a
(supposedly) long-lived handler. Should you wait for a bit before
respawning the handler ? How long are you willing to delay your
events ?

A bit more of checking is planned already, Currently I have an failure
counter and detect when parser successively dies unsuccessfully, but may
be we can add in an respawn counter, who triggers a delay (maybe
increasing) on to many respawns without processing all the pipe data,
but when handler exit and pipe is empty (poll), then respawn counter is
reset. So you get two or three fast respawns after handler dies (when
timeout on poll) and more data in pipe, then something seams to be
wrong, so start adding increasing delays before respawning. The normal
case is, when handler exit due to timeout, the pipe is empty, so we can
reset the counter and have no need to delay process respawn, as soon as
new data arrive in pipe. And when the respawn counter goes above some
limit or the handler dies unsuccessful, a failure script is spawned
first, with arguments programname, exit code or signal, failure count

It is necessary to ask these questions, and have the mechanisms in
place to handle that case - but the case should actually never
happen: it is an admin error to have the event handler die to fast.

admins don't make errors! ;)

So it's code that should be there but should never be used; and it's
already there in the supervisor program that should monitor your
netlink listener.

Ok, you expect the netlink listener be watched by a supervisor daemon? Fine so the fifo supervisor should also be watched, as it got forked from same process as the netlink reader ... that means when we detect handler failures, we can just die and let the outer supervisor do the job :)

When that happens the system is usually on it's way to hell ... and even if that happens, what does it mean to the system? ... hotplug events are no longer handled, we loose them and may have to re-trigger the plug events, as soon as hotplug events are processed again (however this is achieved) ... and in the worst case you are back at semiautomatic device management, calling "mdev -s" to update device file system.

... but consider conf file got vandalized, or the device file system ... how to suffer from this? ... do you expect to handle those? ... wouldn't it be better to reboot, after counting the failure in some persistent storage?


So my conclusion is that it's just not worth it to allow the event
handler to die. s6-uevent-listener considers that its child should
be long-lived;

That's the problem of spawning the handler in your netlink reader. The netlink reader has to open the pipe for writing in non blocking mode, then write a complete message as a single chunk, failure check the write (you always need and handle it), done. If open/write to pipe is not possible, the device plug system has gone and need restart, so let the netlink listener die (unusual condition). One critical condition should be watched and handled, when pipe is full and write (poll for write) has timeout, what than? ... but this is not different then in your solution.

--
Harald
_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Reply via email to