On Wed, Aug 7, 2013 at 9:34 PM, Iustin Pop <[email protected]> wrote: > On Wed, Aug 07, 2013 at 04:03:38PM +0200, Michele Tartara wrote: > > On Wed, Aug 7, 2013 at 1:56 PM, Guido Trotter <[email protected]> > wrote: > > > > > On Wed, Aug 7, 2013 at 9:36 AM, Thomas Thrainer <[email protected]> > > > wrote: > > > > On Tue, Aug 6, 2013 at 5:56 PM, Michele Tartara <[email protected] > > > > > > wrote: > > > >> +``Configuration management daemon (ConfDW)`` > > > >> + It will run on the master node and it will be responsible for the > > > >> management > > > >> + of the authoritative copy of the cluster configuration (that is, > it > > > >> will be > > > >> + the daemon actually modifying the ``config.data`` file). All the > > > >> requests of > > > >> + configuration changes will have to pass through this daemon. > Having a > > > >> single > > > >> + point of configuration management will also allow Ganeti to get > rid > > > of > > > >> + possible race conditions due to concurrent modifications of the > > > >> configuration. > > > >> + When the configuration is updated, it will have to push the > received > > > >> changes > > > >> + to the ConfDR daemons, to keep them up to date. > > > >> + This daemon will also be the one responsible for managing the > locks, > > > >> granting > > > >> + them to the jobs requesting them, and taking care of freeing > them up > > > if > > > >> the > > > >> + jobs holding them crash or are terminated before releasing them. > > > > > > > > > > > > How? > > > > > > > > > > To be detailed. (in this or a separate design, to keep just the split > > > simpler). > > > (I believe it should be detailed, but as long as we don't think it's > > > impossible we can defer the detailing and point from here to a second > > > design: of course we should have that design too, before > > > implementing). > > > > > > > I guess checking for the existence of a process with the PID of the lock > > older should be enough. > > I know PIDs are not ensured to be uniques, but I think they are unique > > enough for this not to be a problem. > > And if we really think this is going to be a problem, we can also check > the > > actual program command line via /proc. > > This is still not the best way (I think). > > The way this is usually done in Unix is that the forking process "knows" > its children and receives termination signals (SIGCHLD) when they exit; > that way, it knows precisely which children are still running and which > have died. > > So if you keep a simple mapping between child PID and job ID, it should > be fine. > > Note that I don't know how well Haskell deals with SIGCHLD and whether > it's still easily usable or if it's completely hidden by some RunProcess > abstraction… >
That would not work if we want LuxiD to be able to be restarted while jobs are running (might be useful for easier upgrades). We would like to persist information about jobs and the job queue to disk, and then obviously the parent/child relationship is gone. But maybe we could implement the proper way for normal operation and check only on startup using the PID/creation time/cmdline in /proc approach. > > > > >> +leaving the codebase in a consistent and usable state. > > > >> + > > > >> +#. Rename QueryD to LuxiD. > > > > > > > > > > > > Already done. QueryD existed only for a day or so and is probably not > > > worth > > > > mentioning. > > > > > > > If I recall correctly, the review of the patch introducing the renaming > was > > LGTMed (I think by Iustin) after the promise of a design doc explaining > the > > reason for that. This is such a design doc, so I think it should stay > here. > > I hope my LGTM didn't create problems :/ > > And thanks for this design, indeed it's what I was was curious for :) > > iustin > -- Thomas Thrainer | Software Engineer | [email protected] | Google Germany GmbH Dienerstr. 12 80331 München Registergericht und -nummer: Hamburg, HRB 86891 Sitz der Gesellschaft: Hamburg Geschäftsführer: Graham Law, Christine Elizabeth Flores
