Le 18/03/2015 13:34, Harald Becker a écrit :
On 18.03.2015 10:42, Didier Kryn wrote:
Long lived daemons should have both startup methods, selectable by a
parameter, so you make nobodies work more difficult than required.

     OK, I think you are right, because it is a little more than a fork:
you want to detach from the controlling terminal and start a new
session. I agree that it is a pain to do it by hand and it is OK if
there is a command-line switch to avoid all of it.

But there must be this switch.

Ack!


No, restart is not required, as netlink dies, when fifosvd dies (or
later on when the handler dies), the supervisor watching netlink may
then fire up a new netlink reader (possibly after failure management),
where this startup is always done through a central startup command
(e.g. xdev).

The supervisor, never starts up the netlink reader directly, but
watches the process it starts up for xdev. xdev does it's initial
action (startup code) then chains (exec) to the netlink reader. This
may look ugly and unnecessary complicated at the first glance, but is
a known practical trick to drop some memory resources not needed by
the long lived daemon, but required by the start up code. For the
supervisor instance this looks like a single process, it has started
and it may watch until it exits. So from that view it looks, as if
netlink has created the pipe and started the fifosvd, but in fact this
is done by the startup code (difference between flow of operation and
technical placing the code).

     I didn't notice this trick in your description. It is making more
and more sense :-).

I left it out, to make it not unnecessary complicated, and I wanted to focus on the netlink / pipe operation.


     Now look, since nldev (lest's call it by its name) is execed by
xdev, it remains the parent of fifosvd, and therefore it shall receive
the SIGCLD if fifosvd dies. This is the best way for nldev to watch
fifosvd. Otherwise it should wait until it receives an event from the
netlink and tries to write it to the pipe, hence loosing the event and
the possible burst following it. nldev must die on SIGCLD (after piping
available events, though); this is the only "supervision" logic it must
implement, but I think it is critical. And it is the same if nldev is
launched with a long-lived mdev-i without a fifosvd.

netlink reader (nldev) does not need to explicitly watch the fifosvd by SIGCHLD.

Either that piece of code does it's job, or it fails and dies. When fifosvd dies, the read end of the pipe is closed (by kernel), except there is still a handler process (which shall process remaining events from the pipe). As soon as there is neither a fifosvd, nor a handler process, the pipe is shut down by the kernel, and nldev get error when writing to the pipe, so it knows the other end died.

No, you must write to the pipe to detect it is broken. And you won't try to write before you've got an event from the netlink. This event will be lost.

You won't gain much benefit from watching SIGCHLD and reading the process status. It either will give you the information, fifosvd process is still running, or it died (failed). The same information you get from the write to the pipe, when the read end died, you get EPIPE.

You get the information immediately from SIGCLD. You get it too late from the pipe, and you loose at least one event for sure, a whole burst if there is.


Limiting the time, nldev tries to write to the pipe, would although allow to detect stuck operation of fifosvd / handler (won't be given by SIGCHLD watching) ... but (in parallel I discussed that with Laurent), the question is, how to react, when write to the pipe stuck (but no failure)? We can't do much here, and are in trouble either, but Laurent gave the argument: The netlink socket also contain a buffer, which may hold additional events, so we do not loss them, in case processing continues normally. When the kernel buffer fills up to it's limit, let the kernel react to the problem.
    Sure, the limit here is pipe size (adjustable) + netlink buffer size.

... otherwise you are right, nldev's job is to detect failure of the rest of the chain (that is supervise those), and has to react on this. The details of taken actions in this case, need and can be discussed (and may be later adapted), without much impact on other operation.

This clearly means, I'm open for suggestions, which kind of failure handling shall be done. Every action taken, to improve reaction, which is of benefit for the major purpose of the netlink reader, without blowing this up needlessly, is of interest (hold in mind: long lived daemon, trying to keep it simple and small).

My suggestion is: Let the netlink reader detect relevant errors, and exec (not spawn) a script of given name, when there are failures. This is small, and gives the invoked script full control on the failure management (no fixed functionality in a binary). When done, it can either die, letting a higher instance doing the job to restart, or exec back and re-start the hotplug system (may be with a different mechanism). When the script does not exist, the default action is to exit the netlink reader process unsuccessful, giving a higher instance a failure indication and the possibility to react on it.

This is fine as long as the netlink reader keeps control on its exit, not if it's killed.

This netlink reader you describe is not the general tool we were considering up to now, the simple data funnel. If the idea is to integrate such peculiarities as execing a script, then it is not the general tool and why not integrate as well the supervision of mdev-i instead of needing fifosvd. The reason for fifosvd was AFAIU to associate general tools, nldev and mdev-i.

On the other hand, exiting on SIGCLD (after wait()ing the child) is neither a major change to nldev, nor one which would preclude its use in any other case.



Not detect? Sure you closed all open file descriptors for the write
end (a common cave-eat)? I have never bean hit by such a case, except
anyone forgot to close all file descriptors of the write end.

     You notice that something happened on input (AFAIR) but I'm sure
you don't know what. It may be data as well. You must read() to know.

The information is all you need. Either the writer process is still there (good), or has gone (bad).

OK, let's assume fifosvd polls the pipe. As long as poll() blocks, it means nldev is alive and is waiting for event. When poll() returns, it means either nldev has piped an event or it has died, you don't know which; you don't get the information you need because the only way to get it is to read from the pipe.

Now suppose nldev is dead but fifosvd doesn't read. It assumes there is data and launches mdev-i. mdev-i dies immediately and fifosvd polls again; poll returns immediately. This is endless.

However there is an indirect way to get the information that nldev died; it is from the return code of mdev-i.

    Didier

_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Reply via email to