On 12/03/2013 09:30 AM, Eoghan Glynn wrote:

On 12/02/2013 10:24 AM, Julien Danjou wrote:
On Fri, Nov 29 2013, David Kranz wrote:

In preparing to fail builds with log errors I have been trying to make
things easier for projects by maintaining a whitelist. But these bugs in
ceilometer are coming in so fast that I can't keep up. So I am  just
".*" in the white list for any cases I find before gate failing is turned
on, hopefully early this week.
Following the chat on IRC and the bug reports, it seems this might come
  From the tempest tests that are under reviews, as currently I don't
think Ceilometer generates any error as it's not tested.

So I'm not sure we want to whitelist anything?
So I tested this with https://review.openstack.org/#/c/59443/. There are
flaky log errors coming from ceilometer. You
can see that the build at 12:27 passed, but the last build failed twice,
each with a different set of errors. So the whitelist needs to remain
and the ceilometer team should remove each entry when it is believed to
be unnecessary.
Hi David,

Just looking into this issue.

So when you say the build failed, do you mean that errors were detected
in the ceilometer log files? (as opposed to a specific Tempest testcase
having reported a failure)
Yes, exactly. This patch removed the whitelist entries for ceilometer and so those errors then "failed" the build.

If that interpretation of build failure is correct, I think there's a simple
explanation for the compute agent ERRORs seen in the log file for the CI
build related to your patch referenced above, specifically:

   ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not 
valid: domain is not running

The problem I suspect is a side-effect of a nova test that suspends the
instance in question, followed by a race between the ceilometer logic that
discovers the local instances via the nova-api followed by the individual
pollsters that call into the libvirt daemon to gather the disk stats etc.
It appears that the libvirt virDomainBlockStats() call fails with "domain
is not running" for suspended instances.

This would only occur intermittently as it requires the instance to
remain in the suspended state across a polling interval boundary.

So we need tighten up our logic there to avoid spewing needless errors
when a very normal event occurs (i.e. instance suspension).

I've filed a bug[1] which some ideas for addressing the issue - this
will require a bit of discussion before agreeing a way forward, but I'll
prioritize getting this knocked on the head asap.
Great! Thanks. The change I pushed yesterday should help prevent this sort of thing from creeping in across all projects. But as Julian observed, the process of removing entries from the whitelist that are no longer needed due to bug fixes is not so easy and automatic. I'm trying to put together a script that will check the whitelist entries against the last two weeks of builds using logstash but it is not so simple to do that since general regexps cannot be used with logstash.


