> That would not work if we want LuxiD to be able to be restarted while jobs > are running (might be useful for easier upgrades). We would like to persist > information about jobs and the job queue to disk, and then obviously the > parent/child relationship is gone. But maybe we could implement the proper > way for normal operation and check only on startup using the PID/creation > time/cmdline in /proc approach.
Why not reverse the direction of parent/child pings? For example, why not have a UNIX socket in the master and the job processes must ping on the master socket every now and then. This way we just have one socket instead of having one per process. If the master dies, the job processes know because the UNIX socket gets closed. But they just keep trying until the socket comes back. Moreover, perhaps we don't have to persist any job queue information, because when the master comes back up, it will simply collect the pings from the job processes that are still running. If you like this idea, we can even remove the UNIX socket from the picture and simply add a LUXI ping request, used only by the job processes, to communicate with the master. What do you think ? Jose
