On 01/18/2018 09:45 AM, Emilien Macchi wrote:
On Thu, Jan 18, 2018 at 6:34 AM, Or Idgar <[email protected]> wrote:
Hi,
we're encountering many timeouts for zuul gates in TripleO.
For example, see
http://logs.openstack.org/95/508195/28/check-tripleo/tripleo-ci-centos-7-ovb-ha-oooq/c85fcb7/.
rechecks won't help and sometimes specific gate is end successfully and
sometimes not.
The problem is that after recheck it's not always the same gate which is
failed.
Is there someone who have access to the servers load to see what cause this?
alternatively, is there something we can do in order to reduce the running
time for each gate?
We're migrating to RDO Cloud for OVB jobs:
https://review.openstack.org/#/c/526481/
It's a work in progress but will help a lot for OVB timeouts on RH1.
I'll let the CI folks comment on that topic.
I noticed that the timeouts on rh1 have been especially bad as of late
so I did a little testing and found that it did seem to be running more
slowly than it should. After some investigation I found that 6 of our
compute nodes have warning messages that the cpu was throttled due to
high temperature. I've disabled 4 of them that had a lot of warnings.
The other 2 only had a handful of warnings so I'm hopeful we can leave
them active without affecting job performance too much. It won't
accomplish much if we disable the overheating nodes only to overload the
remaining ones.
I'll follow up with our hardware people and see if we can determine why
these specific nodes are overheating. They seem to be running 20
degrees C hotter than the rest of the nodes.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev