On 04/25/2018 10:28 AM, James Slagle wrote:
On Wed, Apr 25, 2018 at 10:55 AM, Dmitry Tantsur <dtant...@redhat.com> wrote:
On 04/25/2018 04:26 PM, James Slagle wrote:

On Wed, Apr 25, 2018 at 9:14 AM, Dmitry Tantsur <dtant...@redhat.com>
wrote:

Hi all,

I'd like to restart conversation on enabling node automated cleaning by
default for the undercloud. This process wipes partitioning tables
(optionally, all the data) from overcloud nodes each time they move to
"available" state (i.e. on initial enrolling and after each tear down).

We have had it disabled for a few reasons:
- it was not possible to skip time-consuming wiping if data from disks
- the way our workflows used to work required going between manageable
and
available steps several times

However, having cleaning disabled has several issues:
- a configdrive left from a previous deployment may confuse cloud-init
- a bootable partition left from a previous deployment may take
precedence
in some BIOS
- an UEFI boot partition left from a previous deployment is likely to
confuse UEFI firmware
- apparently ceph does not work correctly without cleaning (I'll defer to
the storage team to comment)

For these reasons we don't recommend having cleaning disabled, and I
propose
to re-enable it.

It has the following drawbacks:
- The default workflow will require another node boot, thus becoming
several
minutes longer (incl. the CI)
- It will no longer be possible to easily restore a deleted overcloud
node.


I'm trending towards -1, for these exact reasons you list as
drawbacks. There has been no shortage of occurrences of users who have
ended up with accidentally deleted overclouds. These are usually
caused by user error or unintended/unpredictable Heat operations.
Until we have a way to guarantee that Heat will never delete a node,
or Heat is entirely out of the picture for Ironic provisioning, then
I'd prefer that we didn't enable automated cleaning by default.

I believe we had done something with policy.json at one time to
prevent node delete, but I don't recall if that protected from both
user initiated actions and Heat actions. And even that was not enabled
by default.

IMO, we need to keep "safe" defaults. Even if it means manually
documenting that you should clean to prevent the issues you point out
above. The alternative is to have no way to recover deleted nodes by
default.


Well, it's not clear what is "safe" here: protect people who explicitly
delete their stacks or protect people who don't realize that a previous
deployment may screw up their new one in a subtle way.

The latter you can recover from, the former you can't if automated
cleaning is true.

It's not just about people who explicitly delete their stacks (whether
intentional or not). There could be user error (non-explicit) or
side-effects triggered by Heat that could cause nodes to get deleted.

You couldn't recover from those scenarios if automated cleaning were
true. Whereas you could always fix a deployment error by opting in to
do an automated clean. Does Ironic keep track of it a node has been
previously cleaned? Could we add a validation to check whether any
nodes might be used in the deployment that were not previously
cleaned?

Is there a way to only do cleaning right before a node is deployed? If you're about to write a new image to the disk then any data there is forfeit anyway. Since the concern is old data on the disk messing up subsequent deploys, it doesn't really matter whether you clean it right after it's deleted or right before it's deployed, but the latter leaves the data intact for longer in case a mistake was made.

If that's not possible then consider this an RFE. :-)

-Ben

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to