On 18.03.2015 18:08, Didier Kryn wrote:
Either that piece of code does it's job, or it fails and dies. When
 fifosvd dies, the read end of the pipe is closed (by kernel),
except there is still a handler process (which shall process
remaining events from the pipe). As soon as there is neither a
fifosvd, nor a handler process, the pipe is shut down by the
kernel, and nldev get error when writing to the pipe, so it knows
the other end died.

No, you must write to the pipe to detect it is broken. And you won't
try to write before you've got an event from the netlink. This event
will be lost.

Ok, that's true, but even if we catch SIGCHLD, the possibility for a
race condition (event already read from netlink but not written to pipe)
is very high. So you throw in code for catching part of a situation,
which is anyway problematic.

When netlink dies, the socket is closed, so the kernel throws away
further event messages, until a new netlink socket is opened (which take
some time for startup). Do you think it makes much difference, if we
lose one more event?

I think when netlink has to be restartet, it's best to re-trigger the
plug events (cold plug). The handler shall test if device node for an
operation already exist and match the current event, then it may
silently ignore those duplicate plug events (don't redo e.g. plug
scripts) ... but may catch new event message not matching existing
device entry, and do any failure action (not currently in mdev).

You get the information immediately from SIGCLD. You get it too late
from the pipe, and you loose at least one event for sure, a whole
burst if there is.

At the technical principal, you are right, we lose an event message
here, but ...

... one event due to the pipe failure, which mean our hot plug system
has serious problems, then we die and lose how much events until we
fixed the problem and re-started the hotplug system?

Does that one event make a big difference? For what? Extra code, which
doesn't fix the principal problem, or allow to recover?

... and in addition, most likely fifosvd, won't die before netlink has
closed the write end of the pipe. Remember fifosvd does the failure
management for the handler process, restarting when required, only
dieing when there are serious problems, which most likely need admin
(that means manual) intervention.

Do you think it matters losing one more event?

This is fine as long as the netlink reader keeps control on its exit,
not if it's killed.

And when netlink is killed, the it is the responsibility of the higher
instance to bring required stuff up again.

This netlink reader you describe is not the general tool we were
considering up to now, the simple data funnel.

My pseudo code described the principal operation and data flow, not
every glory detail of failure management. So the here described netlink
is what I last called t(netlink the Unix way).

If the idea is to integrate such peculiarities as execing a script,
then it is not the general tool and why not integrate as well the
supervision of mdev-i instead of needing fifosvd. The reason for
fifosvd was AFAIU to associate general tools, nldev and mdev-i.

??? Don't know if I fully understand you here. And why shall exec a failure script violate making netlink a general tool? consider:

nldev -e /path/to/failure/script

With may be a default of /sbin/nldev-fail.

... and that single exec with a fixed and small number of arguments is usually very small, compared to complete failure management (supervision) for the handler process.

Putting this into same code, would make the netlink reader code more complex, then otherwise required, and in addition you lose possible parallelism due to multi threading on modern processors.


On the other hand, exiting on SIGCLD (after wait()ing the child) is
neither a major change to nldev, nor one which would preclude its use
in any other case.

The problem is the complexity, which arise from this. nldev does not wait for any process, but it need to do and grab child status, then see if this is the process id of fifosvd ... ohps, wait, where does we get that? ... Ok, extra parameter to pass, so we know, a bit earlier, our hotplug system is dieing due to serious problems? Which usually won't vanish by simple re-starting, without some kind of intervention, to fix the problem, which let the system die.

The only other situation is killing the netlink system, but this run's slightly different. The kill signal usually go to to the top most process in the chain. That is nldev, which will die and close socket and pipe. Letting the handler finish still pending messages in the pipe, then exit gracefully. Which shall be detected by fifosvd, and let this helper also vanish.

And if one really kill fifosvd, just for fun (needs root privileges), we are in trouble anyway. Does one more lost hotplug event matter in that case?

OK, let's assume fifosvd polls the pipe. As long as poll() blocks, it
means nldev is alive and is waiting for event. When poll() returns,
it means either nldev has piped an event or it has died, you don't
know which; you don't get the information you need because the only
way to get it is to read from the pipe.

And here you are wrong: When nldev writes something to the pipe, poll will return read possible for the pipe (good, need handling event), but if ndev shut down the pipe, the poll needs (and this needs to be checked again) to throw any kind of information, because further read from the pipe won't give any result, so it would be nonsense to wait for read. I know with select this detection was no problem, but just assume for now, poll signals read possible, when nldev has closed the pipe. This would result in fifosvd start a handler process, which tries to read from the pipe and detect what? End of file. So the handler dies, and signals end-of-file. This brings fifosvd also down. I told you, nevertheless which way fifosvd takes, at the end of this way it shall detect, that the pipe is gone, and take his hat too.


Now suppose nldev is dead but fifosvd doesn't read.

fifosvd will never read, and (in case poll does us not give a signal of the eof, it will let us start a handler), which will do the read, which detect the pipe is gone.

It assumes there is data and launches mdev-i. mdev-i dies immediately and 
fifosvd
polls again; poll returns immediately. This is endless.

... just to note: I don't know, who brought that mdev -i in, it does not match my intention for -i, which was chosen for "init", the initial device file system creation and setup. I assume you mean the device conf file parser / device operation handler part of the system.

No. Failure management, means detecting such ping-pong plays, throwing us out of this loop.

I know that pseudo code was simplified, and failure handling is always the harder part of the job. So did you expect that pseudo code to be the complete program structure? Then sorry, the intention was just to show the principal operation and data flow. Not the details of supervision, you are asking for.


However there is an indirect way to get the information that nldev
died; it is from the return code of mdev-i.

Ack, one way, which may fifosvd take to detect the system is dieing, one of several ways, I hav'nt listed all yet, but we didn't step to the code hacking phase yet. Currently I'm splitting of the initial part, from the device file system part, as I have bean asked, to make that even more general. It may then be used as a general table driven tool for device operation, not only for device file system, without need to fear logical mixing.

--
Harald
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Reply via email to