On Thu, Feb 10, 2011 at 11:29:27AM +0100, Stephen Shirley wrote:
> On Thu, Feb 10, 2011 at 11:10, Iustin Pop <ius...@google.com> wrote:
> > On Thu, Feb 10, 2011 at 11:04:31AM +0100, Stephen Shirley wrote:
> >> On Wed, Feb 9, 2011 at 09:58, Iustin Pop <ius...@google.com> wrote:
> >> > Any errors in _MergeConfig will leave the cluster in a "broken" state,
> >> > as nothing will restart the master daemon. You should add some error
> >> > checking here.
> >>
> >> That's not by accident. Once _MergeConfig starts the local config
> >> should be considered broken, as i see it. If it fails, do we really
> >> want to restart masterd?
> >
> > Not saying we want to do that, but we should have some kind of error
> > handling, even if just to say:
> >  raise errors.OpExecError("Cluster merging failed during configuration
> >  merge. The master daemon was left stopped, please investigate and fix
> >  manually").
> 
> Here's what currently happens:
> 
> lrfn1:~# /usr/lib/ganeti/tools/cluster-merge  lrfgnt2
> 2011-02-10 11:22:36,012: ERROR ("Desired group name 'default' already
> exists as a node group (UUID: 9864afa0-5b0b-4695-98b8-5f49d82a53a2)",
> 'already_exists')
> Traceback (most recent call last):
>   File "/usr/lib/ganeti/tools/cluster-merge", line 411, in Merge
>     self._MergeConfig()
>   File "/usr/lib/ganeti/tools/cluster-merge", line 272, in _MergeConfig
>     self._MergeNodeGroups(my_config, other_config)
>   File "/usr/lib/ganeti/tools/cluster-merge", line 309, in _MergeNodeGroups
>     my_config.AddNodeGroup(grp, _CLUSTERMERGE_ECID)
>   File "/usr/lib/python2.6/dist-packages/ganeti/locking.py", line 71,
> in sync_function
>     return fn(*args, **kwargs)
>   File "/usr/lib/python2.6/dist-packages/ganeti/config.py", line 930,
> in AddNodeGroup
>     self._UnlockedAddNodeGroup(group, ec_id, check_uuid)
>   File "/usr/lib/python2.6/dist-packages/ganeti/config.py", line 953,
> in _UnlockedAddNodeGroup
>     errors.ECODE_EXISTS)
> OpPrereqError: ("Desired group name 'default' already exists as a node
> group (UUID: 9864afa0-5b0b-4695-98b8-5f49d82a53a2)", 'already_exists')
> 2011-02-10 11:22:36,016: CRITICAL In order to rollback do the following:
> 2011-02-10 11:22:36,016: CRITICAL   * Remove our key from
> authorized_keys on nodes: ['lrfn4', 'lrfn5', 'lrfn6']
> 2011-02-10 11:22:36,016: CRITICAL   * Start all instances again on the
> merging clusters: ['lrfgnt2']
> 2011-02-10 11:22:36,016: CRITICAL   * Restore
> /var/lib/ganeti/config.data from another master candidate
> lrfn1:~#
> 
> That last step should probably mention restarting the master daemon
> too. Does that cover what you're wanting?

Yep, thanks. Let me review then the patch…

iustin

Reply via email to