On 09/16/2014 05:44 AM, Thierry Carrez wrote:
> Michael Still wrote:
>> Yes, that was my point. I don't mind us debating how to rearrange
>> hypervisor drivers. However, if we think that will solve all our
>> problems we are confused.
>>
>> So, how do we get people to start taking bugs / gate failures more seriously?
> 
> I think we need to build a cross-project team working on that. Having
> gate liaisons designated in every project should help bootstrap that
> team -- it doesn't mean it's a one-person-per-project job, but at least
> you have a contact person when you need an expert in some project that
> is also versed in the arts of the gate.
> 
> I also think we need to do a slightly better job at visualizing issues.
> Like Dims said, even with tabs opened to the right places, it's
> non-trivial to determine which is the killer bug from which isn't. And
> without carefully checking IRC backlog in 4 different channels, it's
> also hard to find out that a bug is already taken care of. I woke up one
> morning with gate being obviously stuck on some issue, investigated it,
> only to realize after 30 minutes that the fix was already in the gate
> queue. That's a bit of a frustrating experience.
>
> Finally, it's not completely crazy to use a specific channel
> (#openstack-gate ?) for that. Yes, there is a lot of overlap with -qa
> and -infra channels, but those channels aren't dedicated to that
> problem, so 25% of the issues are discussed on one, 25% on the other,
> 25% on the project-specific channel, and the remaining 25% on some
> random channel the right people happen to be in. Having a clear channel
> where all the gate liaisons hang out and all issues are discussed may go
> a long way into establishing a team to work on that (rather than
> continue to rely on the same set of willing individuals).

Honestly, I'm pretty anti 'add another channel'. Especially because
there seems to be some assumption that you can address this problem
without understanding our integration environment (devstack / tempest /
d-g). This is not a problem in isolation, it's a problem about the
synthesis of all the parts. The diving on these issues is already
happening in a place, we should build on that, and not synthetically
create some 3rd place esperanto channel thinking that will fix the issue.

I've thought about the visualization problem a lot... some of the output
included the os-loganalyze and elastic-recheck projects as well as
pretty-tox in tempest to ensure we see which worker each test is running
in so you can figure out what's happening simultaneously.

Here's the root problem I ran into. What kinds of visualizations are
useful changes at a pretty good clip. These bugs are hard to find and
fix because they are typically the interaction of a bunch of moving parts.

So the tools you need to fix them are some combination of
visualizations, plus a reasonable mental model in your head of how all
of OpenStack fits together (and how components expose to operators what
they are doing). I actually think part 2 is actually the weak spot for
most folks. Knowing that glanceclient's logging is rediculous, and you
should ignore it (for instance), because it spews a ton of ERRORS for no
good reason.

Basically that's the key skill. Understanding the request flows that go
through OpenStack, understanding how to read OpenStack logs, and being
mindful that the issue might be caused by other things happening at the
same time that you are trying to do a thing (so keep an eye out for those).

        -Sean

-- 
Sean Dague
http://dague.net

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to