Wayne Marshall:
Under a supervision framework, failure of a service starting is absolutely ok.  
(Many novices fail to grasp the elegance of this essential feature.)

... and novices and non-novices alike fail to grasp its unscalability. It may be fine on a hobbyist PC, but on a server in a datacentre one gets situations like a program that needs two database servers and a message queue broker to be up and ready before it can run, which one is running 10 instances of for scalability. 10 client programs crashing and restarting over and over whilst rabbitmq-server and mysqld are trying to come up do not make for a happy startup. "I want", says the system administrator, "my machine to spend its precious processor and disc on bringing up the things that everything is waiting for, not on repeatedly starting and crashing the things that are doing the waiting." Let us not forget the logfile and monitoring system noise that the thundering herd approach engenders, too.

Two things make this world more tolerable: early server socket opening and readiness protocols. Unfortunately, much "enterprise" software has yet to even embrace the former, let alone the latter. But there are some promising tiny green shoots. Early server socket opening makes clients _block_ rather than _abend_. Readiness protocols fill in corner cases that aren't necessarily strictly client-server, and also deal with the fact that "up for over N seconds" may or may not mean "ready" according to what day of the week it is (i.e. what the system activity pattern happens to be at the time).

Wayne Marshall:
Note also that in no case is it necessary for a service runscript to try 
starting dependencies itself -- this is all left to the supervisor.

It need not even be the purview of the service manager. nosh doesn't do dependency processing in either the "run" programs or the service manager. It does it in the "system-control" program. Dependency processing is "policy", the decision of what to start and what to stop, in what order, and when. Service management is "mechanism", the raw mechanics of service state. With this split, one can even have two "policies", system-control and service-dt-scanner, running at the same time even. Or someone could come along and write a third, indeed.

Reply via email to