On 10/23/2013 03:35 PM, Robert Collins wrote:
On 24 October 2013 08:28, John Griffith <[email protected]> wrote:
So I touched on this a bit in my earlier post but want to reiterate here and
maybe clarify a bit.  I agree that cleaning up and standardizing the logs is
a good thing, and particularly removing unhandled exception messages would
be good.  What concerns me however is the approach being taken here of
saying things like "Error level messages are banned from Tempest runs".

The case I mentioned earlier of the negative test is a perfect example.
There's no way for Cinder (or any other service) to know the difference
between the end user specifying/requesting a non-existent volume and a valid
volume being requested that for some reason can't be found.  I'm not quite
sure how you place a definitive rule like "no error messages in logs" unless
you make your tests such that you never run negative tests?

Let me check that I understand: you want to check that when a user
asks for a volume that doesn't exist, they don't get it, *and* that
the reason they didn't get it was due to Cinder detecting it's
missing, not due to e.g. cinder throwing an error and returning 500 ?

If so, that seems pretty straight forward; a) check the error that is
reported (it should be a 404 and contain an explanation which we can
check) and b) check the logs to see that nothing was logged (because a
server fault would be logged).

There are other cases in cinder as well that I'm concerned about.  One
example is iscsi target creation, there are a number of scenarios where this
can fail under certain conditions.  In most of these cases we now have retry
mechanisms or alternate implementations to complete the task.  The fact is
however that a call somewhere in the system failed, this should be something
in my opinion that stands out in the logs.  Maybe this particular case would
be well suited to being a warning other than an error, and that's fine.  My
point however though is that I think some thought needs to go into this
before making blanketing rules and especially gating criteria that says "no
error messages in logs".

Absolutely agreed. That's why I wanted to kick off this discussion and am thinking about how we get to agreement by Icehouse (giving this lots of time to bake and getting different perspectives in here).

On the short term of failing jobs in tempest because they've got errors in the logs, we've got a whole white list mechanism right now for "acceptable errors". Over time I'd love to shrink that to 0. But that's going to be a collaboration between the QA team and the specific core projects to make sure that's the right call in each case. Who knows, maybe there are generally agreed to ERROR conditions that we trigger, but we'll figure that out overtime.

I think the iscsi example is a good case for WARNING, which is the same level we use when we fail to schedule a resource (compute / volume). Especially because we try to recover now. If we fail to recover, ERROR is probably called for. But if we actually failed to alocate a volume, we'd end up failing the tests anyways, which means the ERROR in the log wouldn't be a problem in and of itself.

I agree thought and care is needed. As a deployer my concern is that
the only time ERROR is logged in the logs is when something is wrong
with the infrastructure (rather than a user asking for something
stupid). I think my concern and yours can both be handled at the same
time.

Right, and I think this is the perspective that I'm coming from. Our logs (at INFO and up) are UX to our cloud admins.

We should be pretty sure that we know something is a problem if we tag it as an ERROR, or CRITICAL. Because that's likely to be something that negatively impacts someones day.

If we aren't completely sure your cloud is on fire, but we're pretty sure something is odd, WARNING is appropriate.

If it's no good, but we have no way to test if it's a problem, it's just INFO. I really think the "not found" case falls more into standard INFO.

Again, more concrete instances like the iscsi case, are probably the most helpful. I think in the abstract this problem is too hard to solve, but with examples, we can probably come to some concensus.

        -Sean

--
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to