On 09/16/2014 05:44 AM, Thierry Carrez wrote: > Michael Still wrote: >> Yes, that was my point. I don't mind us debating how to rearrange >> hypervisor drivers. However, if we think that will solve all our >> problems we are confused. >> >> So, how do we get people to start taking bugs / gate failures more seriously? > > I think we need to build a cross-project team working on that. Having > gate liaisons designated in every project should help bootstrap that > team -- it doesn't mean it's a one-person-per-project job, but at least > you have a contact person when you need an expert in some project that > is also versed in the arts of the gate. > > I also think we need to do a slightly better job at visualizing issues. > Like Dims said, even with tabs opened to the right places, it's > non-trivial to determine which is the killer bug from which isn't. And > without carefully checking IRC backlog in 4 different channels, it's > also hard to find out that a bug is already taken care of. I woke up one > morning with gate being obviously stuck on some issue, investigated it, > only to realize after 30 minutes that the fix was already in the gate > queue. That's a bit of a frustrating experience. > > Finally, it's not completely crazy to use a specific channel > (#openstack-gate ?) for that. Yes, there is a lot of overlap with -qa > and -infra channels, but those channels aren't dedicated to that > problem, so 25% of the issues are discussed on one, 25% on the other, > 25% on the project-specific channel, and the remaining 25% on some > random channel the right people happen to be in. Having a clear channel > where all the gate liaisons hang out and all issues are discussed may go > a long way into establishing a team to work on that (rather than > continue to rely on the same set of willing individuals).
Honestly, I'm pretty anti 'add another channel'. Especially because there seems to be some assumption that you can address this problem without understanding our integration environment (devstack / tempest / d-g). This is not a problem in isolation, it's a problem about the synthesis of all the parts. The diving on these issues is already happening in a place, we should build on that, and not synthetically create some 3rd place esperanto channel thinking that will fix the issue. I've thought about the visualization problem a lot... some of the output included the os-loganalyze and elastic-recheck projects as well as pretty-tox in tempest to ensure we see which worker each test is running in so you can figure out what's happening simultaneously. Here's the root problem I ran into. What kinds of visualizations are useful changes at a pretty good clip. These bugs are hard to find and fix because they are typically the interaction of a bunch of moving parts. So the tools you need to fix them are some combination of visualizations, plus a reasonable mental model in your head of how all of OpenStack fits together (and how components expose to operators what they are doing). I actually think part 2 is actually the weak spot for most folks. Knowing that glanceclient's logging is rediculous, and you should ignore it (for instance), because it spews a ton of ERRORS for no good reason. Basically that's the key skill. Understanding the request flows that go through OpenStack, understanding how to read OpenStack logs, and being mindful that the issue might be caused by other things happening at the same time that you are trying to do a thing (so keep an eye out for those). -Sean -- Sean Dague http://dague.net _______________________________________________ OpenStack-dev mailing list OpenStackfirstname.lastname@example.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev