On 10/5/2018 6:59 PM, melanie witt wrote:
5) when live migration fails due to a internal error rollback is not handled correctly https://bugs.launchpad.net/nova/+bug/1788014

- Bug was reported on 2018-08-20
- The change that caused the regression landed on 2018-07-26, FF day https://review.openstack.org/434870
- Unrelated to a blueprint, the regression was part of a bug fix
- Was found because sean-k-mooney was doing live migrations and found that when a LM failed because of a QEMU internal error, the VM remained ACTIVE but the VM no longer had network connectivity.
- Question: why wasn't this caught earlier?
- Answer: We would need a live migration job scenario that intentionally initiates and fails a live migration, then verify network connectivity after the rollback occurs.
- Question: can we add something like that?

Not in Tempest, no, but we could run something in the nova-live-migration job since that executes via its own script. We could hack something in like what we have proposed for testing evacuate:


The trick is figuring out how to introduce a fault in the destination host without taking down the service, because if the compute service is down we won't schedule to it.

6) nova-manage db online_data_migrations hangs on instances with no host set https://bugs.launchpad.net/nova/+bug/1788115

- Bug was reported on 2018-08-21
- The patch that introduced the bug landed on 2018-05-30 https://review.openstack.org/567878
- Unrelated to a blueprint, the regression was part of a bug fix
- Question: why wasn't this caught earlier?
- Answer: To hit the bug, you had to have had instances with no host set (that failed to schedule) in your database during an upgrade. This does not happen during the grenade job - Question: could we add anything to the grenade job that would leave some instances with no host set to cover cases like this?

Probably - I'd think creating a server on the old side with some parameters that we know won't schedule would do it, maybe requesting an AZ that doesn't exist, or some other kind of scheduler hint that we know won't work so we get a NoValidHost. However, online_data_migrations in grenade probably don't run on the cell0 database, so I'm not sure we would have caught that case.




OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe

Reply via email to