I think harmonizing the log files is a great idea, when working on elastic-recheck I spent a lot of time staring at log files and cursing at how bad and non-uniform they are. I can only imagine what cloud operators must think.
In addition to harmonizing the log levels, and makings sure we don't have scary looking (stacktrace etc) logs during a normal tempest run I think we should: * Make sure that all projects use the same logging format and use request-ids. I have already filed bugs for neutron and ceilometer on this ( https://bugs.launchpad.net/neutron/+bug/1239923 https://bugs.launchpad.net/ceilometer/+bug/1244182) and I have a hunch other projects may not use these either. * Have better default log levels for dependencies, for example when debug logging is enabled for nova, I don't think we really need debug level logs on for amqp, although perhaps I am wrong. On Wed, Oct 23, 2013 at 8:55 PM, Sean Dague <s...@dague.net> wrote: > On 10/23/2013 03:35 PM, Robert Collins wrote: > >> On 24 October 2013 08:28, John Griffith <john.griff...@solidfire.com> >> wrote: >> >>> So I touched on this a bit in my earlier post but want to reiterate here >>> and >>> maybe clarify a bit. I agree that cleaning up and standardizing the >>> logs is >>> a good thing, and particularly removing unhandled exception messages >>> would >>> be good. What concerns me however is the approach being taken here of >>> saying things like "Error level messages are banned from Tempest runs". >>> >>> The case I mentioned earlier of the negative test is a perfect example. >>> There's no way for Cinder (or any other service) to know the difference >>> between the end user specifying/requesting a non-existent volume and a >>> valid >>> volume being requested that for some reason can't be found. I'm not >>> quite >>> sure how you place a definitive rule like "no error messages in logs" >>> unless >>> you make your tests such that you never run negative tests? >>> >> >> Let me check that I understand: you want to check that when a user >> asks for a volume that doesn't exist, they don't get it, *and* that >> the reason they didn't get it was due to Cinder detecting it's >> missing, not due to e.g. cinder throwing an error and returning 500 ? >> >> If so, that seems pretty straight forward; a) check the error that is >> reported (it should be a 404 and contain an explanation which we can >> check) and b) check the logs to see that nothing was logged (because a >> server fault would be logged). >> >> There are other cases in cinder as well that I'm concerned about. One >>> example is iscsi target creation, there are a number of scenarios where >>> this >>> can fail under certain conditions. In most of these cases we now have >>> retry >>> mechanisms or alternate implementations to complete the task. The fact >>> is >>> however that a call somewhere in the system failed, this should be >>> something >>> in my opinion that stands out in the logs. Maybe this particular case >>> would >>> be well suited to being a warning other than an error, and that's fine. >>> My >>> point however though is that I think some thought needs to go into this >>> before making blanketing rules and especially gating criteria that says >>> "no >>> error messages in logs". >>> >> > Absolutely agreed. That's why I wanted to kick off this discussion and am > thinking about how we get to agreement by Icehouse (giving this lots of > time to bake and getting different perspectives in here). > > On the short term of failing jobs in tempest because they've got errors in > the logs, we've got a whole white list mechanism right now for "acceptable > errors". Over time I'd love to shrink that to 0. But that's going to be a > collaboration between the QA team and the specific core projects to make > sure that's the right call in each case. Who knows, maybe there are > generally agreed to ERROR conditions that we trigger, but we'll figure that > out overtime. > > I think the iscsi example is a good case for WARNING, which is the same > level we use when we fail to schedule a resource (compute / volume). > Especially because we try to recover now. If we fail to recover, ERROR is > probably called for. But if we actually failed to alocate a volume, we'd > end up failing the tests anyways, which means the ERROR in the log wouldn't > be a problem in and of itself. > > > I agree thought and care is needed. As a deployer my concern is that >> the only time ERROR is logged in the logs is when something is wrong >> with the infrastructure (rather than a user asking for something >> stupid). I think my concern and yours can both be handled at the same >> time. >> > > Right, and I think this is the perspective that I'm coming from. Our logs > (at INFO and up) are UX to our cloud admins. > > We should be pretty sure that we know something is a problem if we tag it > as an ERROR, or CRITICAL. Because that's likely to be something that > negatively impacts someones day. > > If we aren't completely sure your cloud is on fire, but we're pretty sure > something is odd, WARNING is appropriate. > > If it's no good, but we have no way to test if it's a problem, it's just > INFO. I really think the "not found" case falls more into standard INFO. > > Again, more concrete instances like the iscsi case, are probably the most > helpful. I think in the abstract this problem is too hard to solve, but > with examples, we can probably come to some concensus. > > > -Sean > > -- > Sean Dague > http://dague.net > > ______________________________**_________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.**org <OpenStack-dev@lists.openstack.org> > http://lists.openstack.org/**cgi-bin/mailman/listinfo/**openstack-dev<http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev> >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev