On 01/19/2017 01:18 PM, Alex Schultz wrote:
On Thu, Jan 19, 2017 at 10:34 AM, Jay Pipes <jaypi...@gmail.com> wrote:
On 01/19/2017 11:25 AM, Alex Schultz wrote:

On Thu, Jan 19, 2017 at 8:27 AM, Matt Riedemann
<mrie...@linux.vnet.ibm.com> wrote:

Sylvain and I were talking about how he's going to work placement
microversion requests into his filter scheduler patch [1]. He needs to
make
requests to the placement API with microversion 1.4 [2] or later for
resource provider filtering on specific resource classes like VCPU and
MEMORY_MB.

The question was what happens if microversion 1.4 isn't available in the
placement API, i.e. the nova-scheduler is running Ocata code now but the
placement service is running Newton still.

Our rolling upgrades doc [3] says:

"It is safest to start nova-conductor first and nova-api last."

But since placement is bundled with n-api that would cause issues since
n-sch now depends on the n-api code.

If you package the placement service separately from the nova-api service
then this is probably not an issue. You can still roll out n-api last and
restart it last (for control services), and just make sure that placement
is
upgraded before nova-scheduler (we need to be clear about that in [3]).

But do we have any other issues if they are not packaged separately? Is
it
possible to install the new code, but still only restart the placement
service before nova-api? I believe it is, but want to ask this out loud.


Forgive me as I haven't looked really in depth, but if the api and
placement api are both collocated in the same apache instance this is
not necessarily the simplest thing to achieve.  While, yes it could be
achieved it will require more manual intervention of custom upgrade
scripts. To me this is not a good idea. My personal preference (now
having dealt with multiple N->O nova related acrobatics) is that these
types of requirements not be made.  We've already run into these
assumptions for new installs as well specifically in this newer code.
Why can't we turn all the services on and they properly enter a wait
state until such conditions are satisfied?


Simply put, because it adds a bunch of conditional, temporary code to the
Nova codebase as a replacement for well-documented upgrade steps.

Can we do it? Yes. Is it kind of a pain in the ass? Yeah, mostly because of
the testing requirements.


<rant>
You mean understanding how people actually consume your software and
handling those cases?  To me this is the fundamental problem if you
want software adoption, understand your user.

The fact that we have these conversations should indicate that we are concerned about users. Nova developers, more than any other OpenStack project, has gone out of its way to put smooth upgrade processes as the project's highest priority.

However, deployment/packaging concerns aren't necessarily cloud *user* concerns. And I don't mean to sound like I'm brushing off the concerns of deployers, but deployers don't necessarily *use* the software we produce either. They install/package it/deploy it. It's application developer teams that *use* the software.

What we're really talking about here is catering to a request that simply doesn't have much real-world impact -- to cloud users *or* to deployers, even those using continuous delivery mechanisms.

If there is a few seconds of log lines outputting error messages and some 400 requests returned from the scheduler while a placement API service is upgraded and restarted (again, ONLY if the placement API service is upgraded after the scheduler) I'm cool with that. It's really not a huge deal to me.

What *would* be a big deal is if any of the following occur:

a) The scheduler dies a horrible death and goes offline
b) Any of the compute nodes failed and went offline
c) Anything regarding the tenant data plane was disrupted

Those are the real concerns for us, and if we have introduced code that results in any of the above, we absolutely will prioritize bug fixes ASAP.

But, as far as I know, we have *not* introduce code that would result in any of the above.

> Know what you're doing
and the impact on them.

Yeah, sorry, but we absolutely *are* concerned about users. What we're not as concerned about is a few seconds of temporary disruption to the control plane.

>  I was just raising awareness around how some
people are deploying this stuff because it feels that sometimes folks
just don't know or don't care.

We *do* care, thus this email and the ongoing conversations on IRC.

>  So IMHO adding service startup/restart
ordering requirements is not ideal for the person who has to run your
software because it makes the entire process hard and more complex.

Unless I'm mistaken, this is not *required ordering*. It's recommended ordering of service upgrade/restarts in order to minimize/eliminate downtime of the control plane, but the scheduler service shouldn't die due to these issues. The scheduler should just keep logging an error but continuing to operate (even if just continually returning a 400). When the placement API service is upgraded and restarted, the log errors will stop and the scheduler will start returning successfully.

Why use this when I can just buy a product that does this for me and
handles these types of cases?

Err... what product does this for you and handles these types of cases that you can buy off the shelf?

>  We're not all containers yet which
might alleviate some of this

Containers have nothing to do with this really. Containerizing the control plane merely would allow a set of related service endpoints to be built with new/updated versions of the Nova scheduler, API and placement software and then cut over the existing service endpoints to those new service endpoints in a fashion that would minimize downtime. But some of these services save state. And transitioning state changes isn't something that k8s rolling upgrade functionality or AppController functionality will necessarily help with.

> but as there was a push for the placement
service specifically to be in a shared vhost, this recommended
deployment method introduces these kind of complexities. It's not
something that just affects me.  Squeaky wheel gets the hose, I mean
grease.
</rant>

But meh, I can whip up an amendment to Sylvain's patch that would add the
self-healing/fallback to legacy behaviour if this is what the operator
community insists on.

I think Matt generally has been in the "push forward" camp because we're
tired of delaying improvements to Nova because of some terror that we may
cause some deployer somewhere to restart their controller services in a
particular order in order to minimize any downtime of the control plane.

For the distributed compute nodes, I totally understand the need to tolerate
long rolling upgrade windows. For controller nodes/services, what we're
talking about here is adding code into Nova scheduler to deal with what in
99% of cases will be something that isn't even noticed because the upgrade
tooling will be restarting all these nodes at almost the same time and the
momentary failures that might be logged on the scheduler (400s returned from
the placement API due to using an unknown parameter in a GET request) will
only exist for a second or two as the upgrade completes.

So in our case they will get (re)started at the same time. If that's
not a problem, great.  I've seen services in the past where it's been
a problem when a service actually won't start because the dependent
service is not up yet. That's what I wanted to make sure is not the
case here.

It shouldn't be, no. And if it is, that's a bug.

>   So if we have documented assurance that restarting both at
the same time won't cause any problems or the interaction is that the
api service won't be 'up' until the placement service is available
then I'm good.  I'm not necessarily looking for the 99.9999% uptime.
Just that it doesn't fall on it's face and we have to write extra
deployment code for this. :)

Nothing should fall on its face. If it does, that's a bug.

Best,
-jay

So, yeah, a lot of work and testing for very little real-world benefit,
which is why a number of us just want to more forward...


So for me it's a end user experience thing. I appreciate moving
forward and I think it should be done. But the code you don't want to
write has to be written elsewhere. This isn't something that doesn't
have to be handled. So all the code you don't want to handle gets
handled in tens or hundreds or thousands of different implementations
of the upgrade process.  So rather than just bulldozing through
because you don't want to do it, perhaps considering where the best
place to handle this is more appropriate.  In terms of OpenStack
services, we really don't have that many expectations of specific
ordering of things because the services should play nicely together
and it's better to keep it that way.

Thanks,
-Alex

Best,
-jay


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to