On Wed, Aug 7, 2013 at 9:34 PM, Iustin Pop <[email protected]> wrote:
> On Wed, Aug 07, 2013 at 04:03:38PM +0200, Michele Tartara wrote:
>> On Wed, Aug 7, 2013 at 1:56 PM, Guido Trotter <[email protected]> wrote:
>>
>> > On Wed, Aug 7, 2013 at 9:36 AM, Thomas Thrainer <[email protected]>
>> > wrote:
>> > > On Tue, Aug 6, 2013 at 5:56 PM, Michele Tartara <[email protected]>
>> > > wrote:
>> > >> +``Configuration management daemon (ConfDW)``
>> > >> +  It will run on the master node and it will be responsible for the
>> > >> management
>> > >> +  of the authoritative copy of the cluster configuration (that is, it
>> > >> will be
>> > >> +  the daemon actually modifying the ``config.data`` file). All the
>> > >> requests of
>> > >> +  configuration changes will have to pass through this daemon. Having a
>> > >> single
>> > >> +  point of configuration management will also allow Ganeti to get rid
>> > of
>> > >> +  possible race conditions due to concurrent modifications of the
>> > >> configuration.
>> > >> +  When the configuration is updated, it will have to push the received
>> > >> changes
>> > >> +  to the ConfDR daemons, to keep them up to date.
>> > >> +  This daemon will also be the one responsible for managing the locks,
>> > >> granting
>> > >> +  them to the jobs requesting them, and taking care of freeing them up
>> > if
>> > >> the
>> > >> +  jobs holding them crash or are terminated before releasing them.
>> > >
>> > >
>> > > How?
>> > >
>> >
>> > To be detailed. (in this or a separate design, to keep just the split
>> > simpler).
>> > (I believe it should be detailed, but as long as we don't think it's
>> > impossible we can defer the detailing and point from here to a second
>> > design: of course we should have that design too, before
>> > implementing).
>> >
>>
>> I guess checking for the existence of a process with the PID of the lock
>> older should be enough.
>> I know PIDs are not ensured to be uniques, but I think they are unique
>> enough for this not to be a problem.
>> And if we really think this is going to be a problem, we can also check the
>> actual program command line via /proc.
>
> This is still not the best way (I think).
>
> The way this is usually done in Unix is that the forking process "knows"
> its children and receives termination signals (SIGCHLD) when they exit;
> that way, it knows precisely which children are still running and which
> have died.
>
> So if you keep a simple mapping between child PID and job ID, it should
> be fine.
>
> Note that I don't know how well Haskell deals with SIGCHLD and whether
> it's still easily usable or if it's completely hidden by some RunProcess
> abstraction…
>

Note that if jobs run in separate processes we need to make sure the
way we handle them survives a restart of the job daemon, since they
can be pretty long-run themselves.
As such the SIGCHLD option won't work without further changes. But I
also don't believe in tracking pids and "checking": I think a system
of communication (via filesystem sockets, probably) should be in place
for this to be resilient.
What do you think?

>> > >> +leaving the codebase in a consistent and usable state.
>> > >> +
>> > >> +#. Rename QueryD to LuxiD.
>> > >
>> > >
>> > > Already done. QueryD existed only for a day or so and is probably not
>> > worth
>> > > mentioning.
>> >
>>
>> If I recall correctly, the review of the patch introducing the renaming was
>> LGTMed (I think by Iustin) after the promise of a design doc explaining the
>> reason for that. This is such a design doc, so I think it should stay here.
>
> I hope my LGTM didn't create problems :/
>
> And thanks for this design, indeed it's what I was was curious for :)
>

No problem at all, we wanted the rename anyway.

Thanks,

Guido

Reply via email to