Tagging with tripleo since my suggestion below is specific to that project.

On 10/30/18 11:03 AM, Clark Boylan wrote:
Hello everyone,

A little while back I sent email explaining how the gate queues work and how 
fixing bugs helps us test and merge more code. All of this still is still true 
and we should keep pushing to improve our testing to avoid gate resets.

Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In the 
process of doing this we had to restart Zuul which brought in a new logging 
feature that exposes node resource usage by jobs. Using this data I've been 
able to generate some report information on where our node demand is going. 
This change [0] produces this report [1].

As with optimizing software we want to identify which changes will have the 
biggest impact and to be able to measure whether or not changes have had an 
impact once we have made them. Hopefully this information is a start at doing 
that. Currently we can only look back to the point Zuul was restarted, but we 
have a thirty day log rotation for this service and should be able to look at a 
month's worth of data going forward.

Looking at the data you might notice that Tripleo is using many more node 
resources than our other projects. They are aware of this and have a plan [2] 
to reduce their resource consumption. We'll likely be using this report 
generator to check progress of this plan over time.

I know at one point we had discussed reducing the concurrency of the tripleo gate to help with this. Since tripleo is still using >50% of the resources it seems like maybe we should revisit that, at least for the short-term until the more major changes can be made? Looking through the merge history for tripleo projects I don't see a lot of cases (any, in fact) where more than a dozen patches made it through anyway*, so I suspect it wouldn't have a significant impact on gate throughput, but it would free up quite a few nodes for other uses.

*: I have no actual stats to back that up, I'm just looking through the IRC backlog for merge bot messages. If such stats do exist somewhere we should look at them instead. :-)


Also related to the long queue backlogs is this proposal [3] to change how Zuul 
prioritizes resource allocations to try to be more fair.

[0] https://review.openstack.org/#/c/613674/
[1] http://paste.openstack.org/show/733644/
[2] http://lists.openstack.org/pipermail/openstack-dev/2018-October/135396.html
[3] http://lists.zuul-ci.org/pipermail/zuul-discuss/2018-October/000575.html

If you find any of this interesting and would like to help feel free to reach 
out to myself or the infra team.

Thank you,
Clark

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to