Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Alex Schultz Tue, 30 Oct 2018 14:02:11 -0700

On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan <cboy...@sapwetik.org> wrote:
>
> On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:
> > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <openst...@nemebean.com> wrote:
> > >
> > > Tagging with tripleo since my suggestion below is specific to that 
> > > project.
> > >
> > > On 10/30/18 11:03 AM, Clark Boylan wrote:
> > > > Hello everyone,
> > > >
> > > > A little while back I sent email explaining how the gate queues work 
> > > > and how fixing bugs helps us test and merge more code. All of this 
> > > > still is still true and we should keep pushing to improve our testing 
> > > > to avoid gate resets.
> > > >
> > > > Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In 
> > > > the process of doing this we had to restart Zuul which brought in a new 
> > > > logging feature that exposes node resource usage by jobs. Using this 
> > > > data I've been able to generate some report information on where our 
> > > > node demand is going. This change [0] produces this report [1].
> > > >
> > > > As with optimizing software we want to identify which changes will have 
> > > > the biggest impact and to be able to measure whether or not changes 
> > > > have had an impact once we have made them. Hopefully this information 
> > > > is a start at doing that. Currently we can only look back to the point 
> > > > Zuul was restarted, but we have a thirty day log rotation for this 
> > > > service and should be able to look at a month's worth of data going 
> > > > forward.
> > > >
> > > > Looking at the data you might notice that Tripleo is using many more 
> > > > node resources than our other projects. They are aware of this and have 
> > > > a plan [2] to reduce their resource consumption. We'll likely be using 
> > > > this report generator to check progress of this plan over time.
> > >
> > > I know at one point we had discussed reducing the concurrency of the
> > > tripleo gate to help with this. Since tripleo is still using >50% of the
> > > resources it seems like maybe we should revisit that, at least for the
> > > short-term until the more major changes can be made? Looking through the
> > > merge history for tripleo projects I don't see a lot of cases (any, in
> > > fact) where more than a dozen patches made it through anyway*, so I
> > > suspect it wouldn't have a significant impact on gate throughput, but it
> > > would free up quite a few nodes for other uses.
> > >
> >
> > It's the failures in gate and resets.  At this point I think it would
> > be a good idea to turn down the concurrency of the tripleo queue in
> > the gate if possible. As of late it's been timeouts but we've been
> > unable to track down why it's timing out specifically.  I personally
> > have a feeling it's the container download times since we do not have
> > a local registry available and are only able to leverage the mirrors
> > for some levels of caching. Unfortunately we don't get the best
> > information about this out of docker (or the mirrors) and it's really
> > hard to determine what exactly makes things run a bit slower.
>
> We actually tried this not too long ago 
> https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
>  but decided to revert it because it didn't decrease the check queue backlog 
> significantly. We were still running at several hours behind most of the time.
>
> If we want to set up better monitoring and measuring and try it again we can 
> do that. But we probably want to measure queue sizes with and without the 
> change like that to better understand if it helps.
>
> As for container image download times can we quantify that via docker logs? 
> Basically sum up the amount of time spent by a job downloading images so that 
> we can see what the impact is but also measure if changes improve that? As 
> for other ideas improving things seems like many of the images that tripleo 
> use are quite large. I recall seeing a > 600MB image just for rsyslog. 
> Wouldn't it be advantageous for both the gate and tripleo in the real world 
> to trim the size of those images (which should improve download times). In 
> any case quantifying the size of the downloads and trimming those if possible 
> is likely also worthwhile.
>


So it's not that simple as we don't just download all the images in a
distinct task and there isn't any information provided around
size/speed AFAIK.  Additionally we aren't doing anything special with
the images (it's mostly kolla built containers with a handful of
tweaks) so that's just the size of the containers.  I am currently
working on reducing any tripleo specific dependencies (ie removal of
instack-undercloud, etc) in hopes that we'll shave off some of the
dependencies but it seems that there's a larger (bloat) issue around
containers in general.  I have no idea why the rsyslog container would
be 600M, but yea that does seem excessive.

> Clark
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Reply via email to