On Thu, Feb 10, 2011 at 11:29:27AM +0100, Stephen Shirley wrote: > On Thu, Feb 10, 2011 at 11:10, Iustin Pop <ius...@google.com> wrote: > > On Thu, Feb 10, 2011 at 11:04:31AM +0100, Stephen Shirley wrote: > >> On Wed, Feb 9, 2011 at 09:58, Iustin Pop <ius...@google.com> wrote: > >> > Any errors in _MergeConfig will leave the cluster in a "broken" state, > >> > as nothing will restart the master daemon. You should add some error > >> > checking here. > >> > >> That's not by accident. Once _MergeConfig starts the local config > >> should be considered broken, as i see it. If it fails, do we really > >> want to restart masterd? > > > > Not saying we want to do that, but we should have some kind of error > > handling, even if just to say: > > raise errors.OpExecError("Cluster merging failed during configuration > > merge. The master daemon was left stopped, please investigate and fix > > manually"). > > Here's what currently happens: > > lrfn1:~# /usr/lib/ganeti/tools/cluster-merge lrfgnt2 > 2011-02-10 11:22:36,012: ERROR ("Desired group name 'default' already > exists as a node group (UUID: 9864afa0-5b0b-4695-98b8-5f49d82a53a2)", > 'already_exists') > Traceback (most recent call last): > File "/usr/lib/ganeti/tools/cluster-merge", line 411, in Merge > self._MergeConfig() > File "/usr/lib/ganeti/tools/cluster-merge", line 272, in _MergeConfig > self._MergeNodeGroups(my_config, other_config) > File "/usr/lib/ganeti/tools/cluster-merge", line 309, in _MergeNodeGroups > my_config.AddNodeGroup(grp, _CLUSTERMERGE_ECID) > File "/usr/lib/python2.6/dist-packages/ganeti/locking.py", line 71, > in sync_function > return fn(*args, **kwargs) > File "/usr/lib/python2.6/dist-packages/ganeti/config.py", line 930, > in AddNodeGroup > self._UnlockedAddNodeGroup(group, ec_id, check_uuid) > File "/usr/lib/python2.6/dist-packages/ganeti/config.py", line 953, > in _UnlockedAddNodeGroup > errors.ECODE_EXISTS) > OpPrereqError: ("Desired group name 'default' already exists as a node > group (UUID: 9864afa0-5b0b-4695-98b8-5f49d82a53a2)", 'already_exists') > 2011-02-10 11:22:36,016: CRITICAL In order to rollback do the following: > 2011-02-10 11:22:36,016: CRITICAL * Remove our key from > authorized_keys on nodes: ['lrfn4', 'lrfn5', 'lrfn6'] > 2011-02-10 11:22:36,016: CRITICAL * Start all instances again on the > merging clusters: ['lrfgnt2'] > 2011-02-10 11:22:36,016: CRITICAL * Restore > /var/lib/ganeti/config.data from another master candidate > lrfn1:~# > > That last step should probably mention restarting the master daemon > too. Does that cover what you're wanting?
Yep, thanks. Let me review then the patch… iustin