On Wed, Aug 7, 2013 at 9:34 PM, Iustin Pop <[email protected]> wrote: > On Wed, Aug 07, 2013 at 04:03:38PM +0200, Michele Tartara wrote: >> On Wed, Aug 7, 2013 at 1:56 PM, Guido Trotter <[email protected]> wrote: >> >> > On Wed, Aug 7, 2013 at 9:36 AM, Thomas Thrainer <[email protected]> >> > wrote: >> > > On Tue, Aug 6, 2013 at 5:56 PM, Michele Tartara <[email protected]> >> > > wrote: >> > >> +``Configuration management daemon (ConfDW)`` >> > >> + It will run on the master node and it will be responsible for the >> > >> management >> > >> + of the authoritative copy of the cluster configuration (that is, it >> > >> will be >> > >> + the daemon actually modifying the ``config.data`` file). All the >> > >> requests of >> > >> + configuration changes will have to pass through this daemon. Having a >> > >> single >> > >> + point of configuration management will also allow Ganeti to get rid >> > of >> > >> + possible race conditions due to concurrent modifications of the >> > >> configuration. >> > >> + When the configuration is updated, it will have to push the received >> > >> changes >> > >> + to the ConfDR daemons, to keep them up to date. >> > >> + This daemon will also be the one responsible for managing the locks, >> > >> granting >> > >> + them to the jobs requesting them, and taking care of freeing them up >> > if >> > >> the >> > >> + jobs holding them crash or are terminated before releasing them. >> > > >> > > >> > > How? >> > > >> > >> > To be detailed. (in this or a separate design, to keep just the split >> > simpler). >> > (I believe it should be detailed, but as long as we don't think it's >> > impossible we can defer the detailing and point from here to a second >> > design: of course we should have that design too, before >> > implementing). >> > >> >> I guess checking for the existence of a process with the PID of the lock >> older should be enough. >> I know PIDs are not ensured to be uniques, but I think they are unique >> enough for this not to be a problem. >> And if we really think this is going to be a problem, we can also check the >> actual program command line via /proc. > > This is still not the best way (I think). > > The way this is usually done in Unix is that the forking process "knows" > its children and receives termination signals (SIGCHLD) when they exit; > that way, it knows precisely which children are still running and which > have died. > > So if you keep a simple mapping between child PID and job ID, it should > be fine. > > Note that I don't know how well Haskell deals with SIGCHLD and whether > it's still easily usable or if it's completely hidden by some RunProcess > abstraction… >
Note that if jobs run in separate processes we need to make sure the way we handle them survives a restart of the job daemon, since they can be pretty long-run themselves. As such the SIGCHLD option won't work without further changes. But I also don't believe in tracking pids and "checking": I think a system of communication (via filesystem sockets, probably) should be in place for this to be resilient. What do you think? >> > >> +leaving the codebase in a consistent and usable state. >> > >> + >> > >> +#. Rename QueryD to LuxiD. >> > > >> > > >> > > Already done. QueryD existed only for a day or so and is probably not >> > worth >> > > mentioning. >> > >> >> If I recall correctly, the review of the patch introducing the renaming was >> LGTMed (I think by Iustin) after the promise of a design doc explaining the >> reason for that. This is such a design doc, so I think it should stay here. > > I hope my LGTM didn't create problems :/ > > And thanks for this design, indeed it's what I was was curious for :) > No problem at all, we wanted the rename anyway. Thanks, Guido
