As can be seen from logstash [1] this bug is hurting us pretty bad in the check queue.

I thought I originally had this fixed with [2] but that turned out to only be part of the issue.

I think I've identified the problem but I have failed to write a recreate regression test [3] because (I think) it's due to random ordering of which request spec we select to send to the scheduler during a multi-create request (and I tried making that predictable by sorting the instances by uuid in both conductor and the scheduler but that didn't make a difference in my test).

I started with one fix yesterday [4] but that would regress an earlier fix for resizing servers to the same host which are in an anti-affinity group. If we went that route, it will involve changes to how we handle RequestSpec.num_instances (either not persist it, or reset it during move operations).

After talking with Sean Mooney, we have another fix which is self-contained to the scheduler [5] so we wouldn't need to make any changes to the RequestSpec handling in conductor. It's admittedly a bit hairy, so I'm asking for some eyes on it since either way we go, we should get going soon before we hit the FF and RC1 rush which *always* kills the gate.

[1] http://status.openstack.org/elastic-recheck/index.html#1781710
[2] https://review.openstack.org/#/c/582976/
[3] https://review.openstack.org/#/c/583339
[4] https://review.openstack.org/#/c/583351
[5] https://review.openstack.org/#/c/583347

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to