Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

Wesley Hayutin Mon, 14 May 2018 08:59:00 -0700

On Mon, May 14, 2018 at 10:36 AM Jeremy Stanley <[email protected]> wrote:

> On 2018-05-14 07:07:03 -0600 (-0600), Wesley Hayutin wrote:
> [...]
> > I think you may be conflating the notion that ubuntu or rhel/cent
> > can be updated w/o any issues to applications that run atop of the
> > distributions with what it means to introduce a minor update into
> > the upstream openstack ci workflow.
> >
> > If jobs could execute w/o a timeout the tripleo jobs would have
> > not gone red.  Since we do have constraints in the upstream like a
> > timeouts and others we have to prepare containers, images etc to
> > work efficiently in the upstream.  For example, if our jobs had
> > the time to yum update the roughly 120 containers in play in each
> > job the tripleo jobs would have just worked.  I am not advocating
> > for not having timeouts or constraints on jobs, however I am
> > saying this is an infra issue, not a distribution or distribution
> > support issue.
> >
> > I think this is an important point to consider and I view it as
> > mostly unrelated to the support claims by the distribution.  Does
> > that make sense?
> [...]
>
> Thanks, the thread jumped straight to suggesting costly fixes
> (separate images for each CentOS point release, adding an evaluation
> period or acceptance testing for new point releases, et cetera)
> without coming anywhere close to exploring the problem space. Is
> your only concern that when your jobs started using CentOS 7.5
> instead of 7.4 they took longer to run?

Yes, If they had unlimited time to run, our workflow would have everything
updated to CentOS 7.5 in the job itself and I would expect everything to
just work.

> What was the root cause? Are
> you saying your jobs consume externally-produced artifacts which lag
> behind CentOS package updates?

Yes, TripleO has externally produced overcloud images, and containers both
of which can be yum updated but we try to ensure they are frequently
recreated so the yum transaction is small.

> Couldn't a significant burst of new
> packages cause the same symptoms even without it being tied to a
> minor version increase?
>

Yes, certainly this could happen outside of a minor update of the baseos.

>
> This _doesn't_ sound to me like a problem with how we've designed
> our infrastructure, unless there are additional details you're
> omitting.

So the only thing out of our control is the package set on the base
nodepool image.
If that suddenly gets updated with too many packages, then we have to
scramble to ensure the images and containers are also udpated.
If there is a breaking change in the nodepool image for example [a], we
have to react to and fix that as well.

> It sounds like a problem with how the jobs are designed
> and expectations around distros slowly trickling package updates
> into the series without occasional larger bursts of package deltas.
> I'd like to understand more about why you upgrade packages inside
> your externally-produced container images at job runtime at all,
> rather than relying on the package versions baked into them.

We do that to ensure the gerrit review itself and it's dependencies are
built via rpm and injected into the build.
If we did not do this the job would not be testing the change at all.
 This is a result of being a package based deployment for better or worse.

> It
> seems like you're arguing that the existence of lots of new package
> versions which aren't already in your container images is the
> problem, in which case I have trouble with the rationalization of it
> being "an infra issue" insofar as it requires changes to the
> services as provided by the OpenStack Infra team.
>
> Just to be clear, we didn't "introduce a minor update into the
> upstream openstack ci workflow." We continuously pull CentOS 7
> packages into our package mirrors, and continuously rebuild our
> centos-7 images from whatever packages the distro says are current.
>

Understood, which I think is fine and probably works for most projects.
An enhancement could be to stage the new images for say one week or so.
Do we need the CentOS updates immediately? Is there a possible path that
does not create a lot of work for infra, but also provides some space for
projects
to prep for the consumption of the updates?

> Our automation doesn't know that there's a difference between
> packages which were part of CentOS 7.4 and 7.5 any more than it
> knows that there's a difference between Ubuntu 16.04.2 and 16.04.3.
> Even if we somehow managed to pause our CentOS image updates
> immediately prior to 7.5, jobs would still try to upgrade those
> 7.4-based images to the 7.5 packages in our mirror, right?
>

Understood, I suspect this will become a more widespread issue as
more projects start to use containers ( not sure ).  It's my understanding
that
there are some mechanisms in place to pin packages in the centos nodepool
image so
there has been some thoughts generally in the area of this issue.

TripleO may be the exception to the rule here and that is fine, I'm more
interested in exploring
the possibilities of delivering updates in a staged fashion than anything.
I don't have insight into
what the possibilities are, or if other projects have similiar issues or
requests.  Perhaps the TripleO
project could share the details of our job workflow with the community and
this would make more sense.

I appreciate your time, effort and thoughts you have shared in the thread.

> --
> Jeremy Stanley
>

[a] https://bugs.launchpad.net/tripleo/+bug/1770298

> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] tripleo upstream gate outtage, was: -> gate jobs impacted RAX yum mirror

Reply via email to