On Tue, Jul 08, 2014 at 06:21:31PM -0400, Sean Dague wrote: > On 07/08/2014 06:12 PM, Joe Gordon wrote: > > > > > > > > On Tue, Jul 8, 2014 at 2:56 PM, Michael Still <[email protected] > > <mailto:[email protected]>> wrote: > > > > The associated bug says this is probably a qemu bug, so I think we > > should rephrase that to "we need to start thinking about how to make > > sure upstream changes don't break nova". > > > > > > Good point. > > > > > > Would running devstack-tempest on the latest upstream release of ? help. > > Not as a voting job but as a periodic (third party?) job, that we can > > hopefully identify these issues early on. I think the big question here > > is who would volunteer to help run a job like this.
Although, I'm familiar with Gate and infra in depth, I can help volunteer debug such issues (as I try to test libvirt/QEMU upstreams and from git quite frequently). > The running of the job really isn't the issue. > > It's the debugging of the jobs when the go wrong. Creating a new test > job and getting it lit is really < 10% of the work, sifting through the > fails and getting to the bottom of things is the hard and time consuming > part. Very true. For instance -- the live snapshot issue[1], I wish we could get to the logical end of it (without letting it languish) and enable it back in Nova soon. But, as of now, we're not able to pin point the root cause and it's not reproducible any more from Dan Berrange's detailed analysis after a week of tests outside the Gate or tests with some debugging enabled[2] when there's a light load on the Gate -- both cases, he didn't hit the issue after multiple test runs. Dan raised on #openstack-nova if there might be some weird I/O issue in HP cloud that's leading to these timeouts, but Sean said timeout would be an issue only if this (the test in question) take 2 minutes some times and succeed. FWIW, from my local tests of exact Nova invocation of libvirt blockRebase API to do parallel blockcopy operations followed by an explicit abort (to gracefully end the block operation), I couldn't reproduce it on multiple runs either. [1] https://bugs.launchpad.net/nova/+bug/1334398 -- libvirt live_snapshot periodically explodes on libvirt 1.2.2 in the gate [2] https://review.openstack.org/#/c/103066/ > > The other option is to remove more concurrency from nova-compute. It's > pretty clear that this problem only seems to happen when the > snapshotting is going on at the same time guests are being created or > destroyed (possibly also a second snapshot going on). > > This is also why I find it unlikely to be a qemu bug, because that's not > shared state between guests. If qemu just randomly wedges itself, that > would be detectable much easier outside of the gate. And there have been > attempts by danpb to sniff that out, and they haven't worked. > > -Sean > -- /kashyap _______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
