Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Ben Nemec Tue, 30 Oct 2018 13:02:36 -0700


On 10/30/18 1:25 PM, Clark Boylan wrote:

On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote:

On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <[email protected]> wrote:


Tagging with tripleo since my suggestion below is specific to that project.

On 10/30/18 11:03 AM, Clark Boylan wrote:

Hello everyone,

A little while back I sent email explaining how the gate queues work and how 
fixing bugs helps us test and merge more code. All of this still is still true 
and we should keep pushing to improve our testing to avoid gate resets.

Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
process of doing this we had to restart Zuul which brought in a new logging 
feature that exposes node resource usage by jobs. Using this data I've been 
able to generate some report information on where our node demand is going. 
This change [0] produces this report [1].

As with optimizing software we want to identify which changes will have the 
biggest impact and to be able to measure whether or not changes have had an 
impact once we have made them. Hopefully this information is a start at doing 
that. Currently we can only look back to the point Zuul was restarted, but we 
have a thirty day log rotation for this service and should be able to look at a 
month's worth of data going forward.

Looking at the data you might notice that Tripleo is using many more node 
resources than our other projects. They are aware of this and have a plan [2] 
to reduce their resource consumption. We'll likely be using this report 
generator to check progress of this plan over time.


I know at one point we had discussed reducing the concurrency of the
tripleo gate to help with this. Since tripleo is still using >50% of the
resources it seems like maybe we should revisit that, at least for the
short-term until the more major changes can be made? Looking through the
merge history for tripleo projects I don't see a lot of cases (any, in
fact) where more than a dozen patches made it through anyway*, so I
suspect it wouldn't have a significant impact on gate throughput, but it
would free up quite a few nodes for other uses.


It's the failures in gate and resets.  At this point I think it would
be a good idea to turn down the concurrency of the tripleo queue in
the gate if possible. As of late it's been timeouts but we've been
unable to track down why it's timing out specifically.  I personally
have a feeling it's the container download times since we do not have
a local registry available and are only able to leverage the mirrors
for some levels of caching. Unfortunately we don't get the best
information about this out of docker (or the mirrors) and it's really
hard to determine what exactly makes things run a bit slower.


We actually tried this not too long ago 
https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b
 but decided to revert it because it didn't decrease the check queue backlog 
significantly. We were still running at several hours behind most of the time.

I'm surprised to hear that. Counting the tripleo jobs in the gate atpositions 11-20 right now, I see around 84 nodes tied up in long-runningjobs and another 32 for shorter unit test jobs. The latter probablydon't have much impact, but the former is a non-trivial amount. It maynot erase the entire 2300+ job queue that we have right now, but itseems like it should help.


If we want to set up better monitoring and measuring and try it again we can do 
that. But we probably want to measure queue sizes with and without the change 
like that to better understand if it helps.

This seems like good information to start capturing, otherwise we arekind of just guessing. Is there something in infra already that we coulduse or would it need to be new tooling?


As for container image download times can we quantify that via docker logs? 
Basically sum up the amount of time spent by a job downloading images so that we 
can see what the impact is but also measure if changes improve that? As for other 
ideas improving things seems like many of the images that tripleo use are quite 
large. I recall seeing a > 600MB image just for rsyslog. Wouldn't it be 
advantageous for both the gate and tripleo in the real world to trim the size of 
those images (which should improve download times). In any case quantifying the 
size of the downloads and trimming those if possible is likely also worthwhile.

Clark

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [tripleo] Zuul Queue backlogs and resource usage

Reply via email to