Re: [Openstack-operators] [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality

Marc Heckmann Tue, 23 May 2017 06:54:09 -0700

For the anti-affinity use case, it's really useful for smaller or medium size 
operators who want to provide some form of failure domains to users but do not 
have the resources to create AZ's at DC or even at rack or row scale. Don't 
forget that as soon as you introduce AZs, you need to grow those AZs at the 
same rate and have the same flavor offerings across those AZs.


For the retry thing, I think enough people have chimed in to echo the general 
sentiment.

-m


On Mon, 2017-05-22 at 16:30 -0600, David Medberry wrote:
I have to agree with James....

My affinity and anti-affinity rules have nothing to do with NFV. a-a is almost 
always a failure domain solution. I'm not sure we have users actually choosing 
affinity (though it would likely be for network speed issues and/or some sort 
of badly architected need or perceived need for coupling.)

On Mon, May 22, 2017 at 12:45 PM, James Penick 
<jpen...@gmail.com<mailto:jpen...@gmail.com>> wrote:


On Mon, May 22, 2017 at 10:54 AM, Jay Pipes 
<jaypi...@gmail.com<mailto:jaypi...@gmail.com>> wrote:
Hi Ops,

Hi!


For class b) causes, we should be able to solve this issue when the placement 
service understands affinity/anti-affinity (maybe Queens/Rocky). Until then, we 
propose that instead of raising a Reschedule when an affinity constraint was 
last-minute violated due to a racing scheduler decision, that we simply set the 
instance to an ERROR state.

Personally, I have only ever seen anti-affinity/affinity use cases in relation 
to NFV deployments, and in every NFV deployment of OpenStack there is a VNFM or 
MANO solution that is responsible for the orchestration of instances belonging 
to various service function chains. I think it is reasonable to expect the MANO 
system to be responsible for attempting a re-launch of an instance that was set 
to ERROR due to a last-minute affinity violation.

**Operators, do you agree with the above?**

I do not. My affinity and anti-affinity use cases reflect the need to build 
large applications across failure domains in a datacenter.

Anti-affinity: Most anti-affinity use cases relate to the ability to guarantee 
that instances are scheduled across failure domains, others relate to security 
compliance.

Affinity: Hadoop/Big data deployments have affinity use cases, where nodes 
processing data need to be in the same rack as the nodes which house the data. 
This is a common setup for large hadoop deployers.

I recognize that large Ironic users expressed their concerns about IPMI/BMC 
communication being unreliable and not wanting to have users manually retry a 
baremetal instance launch. But, on this particular point, I'm of the opinion 
that Nova just do one thing and do it well. Nova isn't an orchestrator, nor is 
it intending to be a "just continually try to get me to this eventual state" 
system like Kubernetes.

Kubernetes is a larger orchestration platform that provides autoscale. I don't 
expect Nova to provide autoscale, but

I agree that Nova should do one thing and do it really well, and in my mind 
that thing is reliable provisioning of compute resources. Kubernetes does 
autoscale among other things. I'm not asking for Nova to provide Autoscale, I 
-AM- asking OpenStack's compute platform to provision a discrete compute 
resource reliably. This means overcoming common and simple error cases. As a 
deployer of OpenStack I'm trying to build a cloud that wraps the chaos of 
infrastructure, and present a reliable facade. When my users issue a boot 
request, I want to see if fulfilled. I don't expect it to be a 100% guarantee 
across any possible failure, but I expect (and my users demand) that my 
"Infrastructure as a service" API make reasonable accommodation to overcome 
common failures.


If we removed Reschedule for class c) failures entirely, large Ironic deployers 
would have to train users to manually retry a failed launch or would need to 
write a simple retry mechanism into whatever client/UI that they expose to 
their users.

**Ironic operators, would the above decision force you to abandon Nova as the 
multi-tenant BMaaS facility?**


 I just glanced at one of my production clusters and found there are around 7K 
users defined, many of whom use OpenStack on a daily basis. When they issue a 
boot call, they expect that request to be honored. From their perspective, if 
they call AWS, they get what they ask for. If you remove reschedules you're not 
just breaking the expectation of a single deployer, but for my thousands of 
engineers who, every day, rely on OpenStack to manage their stack.

I don't have a "i'll take my football and go home" mentality. But if you remove 
the ability for the compute provisioning API to present a reliable facade over 
infrastructure, I have to go write something else, or patch it back in. Now 
it's even harder for me to get and stay current with OpenStack.

During the summit the agreement was, if I recall, that reschedules would happen 
within a cell, and not between the parent and cell. That was completely 
acceptable to me.

-James


_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators



_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org<mailto:OpenStack-operators@lists.openstack.org>
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

_______________________________________________
OpenStack-operators mailing list
OpenStack-operators@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] [nova][ironic][scheduler][placement] IMPORTANT: Getting rid of the automated reschedule functionality

Reply via email to