On 10 November 2015 at 11:12, Matt Riedemann <[email protected]> wrote:
> > > On 11/10/2015 1:10 PM, Matt Riedemann wrote: > >> >> >> On 11/10/2015 12:51 PM, Armando M. wrote: >> >>> >>> >>> On 10 November 2015 at 10:33, Matt Riedemann <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Let me qualify by saying I'm not a Neutron person. >>> >>> We know that gate-tempest-dsvm-neutron-full is failing hard as of >>> the last 24 hours [1]. >>> >>> An error that's been showing up in tempest runs with neutron a lot >>> is: >>> >>> "AssertionError: 0 == 0 : No IPv4 addresses found in: []" >>> >>> So checking logstash [2] it's hitting a lot. It's only recent >>> because that failure message is new to Tempest in the last day or >>> so, but it has a lot of hits, so whatever it is, it's failing a lot. >>> >>> So the next step is usually digging into service logs looking for >>> errors. I check the q-svc logs first. Not many errors but a >>> bazillion warnings for things not found (networks and devices). [3] >>> >>> For example: >>> >>> 2015-11-10 17:13:02.542 WARNING neutron.plugins.ml2.rpc >>> [req-15a73753-1512-4689-9404-9658a0cd0c09 None None] Device >>> aaa525be-14eb-44a5-beb0-ed722896be93 requested by agent >>> ovs-agent-devstack-trusty-rax-iad-5785199 not found in database >>> >>> 2015-11-10 17:14:17.754 WARNING neutron.api.rpc.handlers.dhcp_rpc >>> [req-3d7e9848-6151-4780-907f-43f11a2a8545 None None] Network >>> b07ad9b2-e63e-4459-879d-3721074704e5 could not be found, it might >>> have been deleted concurrently. >>> >>> Are several hundred of these warnings useful to an operator trying >>> to debug a problem? The point of the CI gate testing is to try and >>> simulate a production cloud environment. When something goes wrong, >>> you check the logs. With the amount of warning/error level logging >>> that is in the neutron logs, finding a real problem is like looking >>> for a needle in a haystack. Since everything is async, 404s are >>> expected when racing to delete a resource and they should be handled >>> gracefully. >>> >>> Anyway, the server log isn't useful so I go digging in the agent >>> logs and stacktraces there are aplenty. [4] >>> >>> Particularly this: >>> >>> "Exception: Port tapcea51630-e1 is not ready, resync needed" >>> >>> That's due to a new change landing in the last 24 hours [5]. But the >>> trace shows up over 16K times since it landed [6]. >>> >>> Checking the code, it's basically a loop processing events and when >>> it hits an event it can't handle, it punts (breaking the loop so you >>> don't process the other events after it - which is a bug), and the >>> code that eventually handles it is just catching all Exception and >>> tracing them out assuming they are really bad. >>> >>> At this point, as a non-neutron person, i.e. not well versed in the >>> operations of neutron or how to debug it in great detail, I assume >>> something is bad here but I don't really know - and the logs are so >>> full of noise that I can't distinguish real failures. >>> >>> I don't mean to pick on this particular change, but it's a good >>> example of a recent thing. >>> >>> I'd like to know if this is all known issue or WIP type stuff. I've >>> complained about excessively noisey neutron logs in channel before >>> and I'm usually told that they are either necessary (for whatever >>> reason) or that rather than complain about the verbosity, I should >>> fix the race that is causing it - which is not likely to happen >>> since I don't have the async rpc happy nature of everything in >>> neutron in my head to debug it (I doubt many do). >>> >>> >>> I am not sure that's a fair statement: we usually pinpoint that just >>> lowering log levels is not really solving the underlying issue >>> (whichever it may be), and that comment really should apply to any >>> project, not just Neutron. That said, we had examples where we took your >>> input and drove the right fix ourselves. >>> >>> We have a 'logging' tag for Neutron bugs that we use to identify these >>> type of cleanups. We'd need your attention to details to alert us of >>> issues like these; we'll take care of the right fixes. Currently, the >>> queue is pretty dry. If you can top it up, that'll be great. Going off >>> on a log cleanup rampage doesn't seem like the best course of action; >>> I'd rather knock issues one by one as they come, like the one you just >>> mentioned. >>> >>> [1] https://bugs.launchpad.net/neutron/+bugs?field.tag=logging >>> >> >> Tagged: >> >> https://bugs.launchpad.net/nova/+bug/1514935 >> >> Although it must not be an official tag since it didn't auto-complete >> for me, that should be fixed. >> > > Crap, nevermind that, my bad - I had created the bug against nova (force > of habit) rather than neutron. Confirmed that 'logging' is a neutron tag in > LP. Ah, gotcha! > > > >> >>> >>> Anyway, this is a plea for sanity in the logs. There are logging >>> guidelines for openstack [7]. Let's please abide by them. Let's keep >>> operators in mind when we're looking at logs and be proactive about >>> making them useful (which includes more granular error handling and >>> less global try/except Exception: LOG.exception constructs). >>> >>> Your point is duly noted. We have documented this, and we are being more >>> meticulous during reviews. >>> >>> >>> [1] http://tinyurl.com/ne3ex4v >>> [2] >>> >>> >>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22AssertionError:%200%20==%200%20:%20No%20IPv4%20addresses%20found%20in:%20%5B%5D%5C%22%20AND%20tags:%5C%22console%5C%22 >>> >>> [3] >>> >>> >>> http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-svc.txt.gz?level=TRACE >>> >>> [4] >>> >>> >>> http://logs.openstack.org/85/239885/2/gate/gate-tempest-dsvm-neutron-full/602d864/logs/screen-q-agt.txt.gz?level=TRACE >>> >>> [5] https://review.openstack.org/#/c/164880/ >>> [6] >>> >>> >>> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message:%5C%22Exception:%20Port%5C%22%20AND%20message:%5C%22is%20not%20ready,%20resync%20needed%5C%22%20AND%20tags:%5C%22screen-q-agt.txt%5C%22 >>> >>> [7] >>> >>> >>> http://specs.openstack.org/openstack/openstack-specs/specs/log-guidelines.html >>> >>> >>> -- >>> >>> Thanks, >>> >>> Matt Riedemann >>> >>> >>> >>> >>> __________________________________________________________________________ >>> >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> >>> <http://[email protected]?subject:unsubscribe> >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>> >>> >>> >>> __________________________________________________________________________ >>> >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> [email protected]?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> > -- > > Thanks, > > Matt Riedemann > > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
