Hello,

We have a few particularly annoying bugs that have been impacting the
reliability of gate testing recently. It would be great if we could get
volunteers to look at these bugs to improve the reliability of our testing as we
start working on Pike.

These two issues have been identified by elastic-recheck as being our biggest
problems:

1. SSH Banner bug http://status.openstack.org/elastic-recheck/#1349617

This bug is a longstanding issue that comes and goes and also has lots of very
similar (but subtly different) failure modes. Tempest attempts to ssh into the
cirros guest and is unable to after 18 attempts over the 300 sec timeout window
and fails to login. Paramiko reports that there was an issue reading the banner
returned on port 22 from the guest. This indicates that something is likely
responding on port 22. We're working on trying to get more details on what is
the cause here with:

https://review.openstack.org/437128

2. Libvirt crashes: http://status.openstack.org/elastic-recheck/#1643911 and
http://status.openstack.org/elastic-recheck/#1646779

Libvirt is randomly crashing during the job which causes things to fail (for
obvious reasons). To address this will likely require someone with experience
debugging libvirt since it's most likely a bug isolated to libvirt. Tonyb has
offered to start working on this so talk to him to coordinate efforts around
fixing this.

The other thing to note is the oom-killer bug:
http://status.openstack.org/elastic-recheck/gate.html#1656386 while there aren't
a lot of hits in logstash for this particular bug, it does raise an import issue
about the increased memory pressure on the test nodes. It's likely that a lot of
the instability may be related to the increased load on the nodes. As a starting
point all projects should look at their memory footprint and see where they can
trim things to try and make the situation better.

As a friendly reminder we do track bug rate incidence within our testing using
the elastic-recheck tool. You can find that data at
http://status.openstack.org/elastic-recheck. It can be quite useful to start
there when determining which bugs to fix based on impact. Elastic recheck also
maintains a list of failures that occurred without a known signature:
http://status.openstack.org/elastic-recheck/data/integrated_gate.html

We also need some people to help maintain the list of existing queries, we have
a lot of queries for closed bugs that have no hits and others which are overly
broad and matching failures which are unrelated to the bug. This would also be
good task for a new person to start getting involved with. Feel free to submit
patches to:
https://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/queries to
track new issues.

Thank you,

mtreinish and clarkb

Attachment: signature.asc
Description: PGP signature

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to