On Tue, Nov 25, 2014 at 01:22:01PM -0800, Vishvananda Ishaya wrote: > > On Nov 25, 2014, at 7:29 AM, Matt Riedemann <[email protected]> > wrote: > > > > > > > On 11/25/2014 9:03 AM, Matt Riedemann wrote: > >> > >> > >> On 11/25/2014 8:11 AM, Sean Dague wrote: > >>> There is currently a review stream coming into Tempest to add Cinder v2 > >>> tests in addition to the Cinder v1 tests. At the same time the currently > >>> biggest race fail in the gate related to the projects is > >>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder > >>> related. > >>> > >>> I believe these 2 facts are coupled. The number of volume tests we have > >>> in tempest is somewhat small, and as such the likelihood of them running > >>> simultaneously is also small. However the fact that as the # of tests > >>> with volumes goes up we are getting more of these race fails typically > >>> means that what's actually happening is 2 vol ops that aren't safe to > >>> run at the same time, are. > >>> > >>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 - > >>> with no assignee. > >>> > >>> So we really needs dedicated diving on this (last bug update with any > >>> code was a month ago), otherwise we need to stop adding these tests to > >>> Tempest, and honestly start skipping the volume tests if we can't have a > >>> repeatable success. > >>> > >>> -Sean > >>> > >> > >> I just put up an e-r query for a newly opened bug > >> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks > >> similar to bug 1373513 but without the blocked task error in syslog. > >> > >> There is a three minute gap between when the volume is being deleted in > >> c-vol logs and when we see the volume uuid logged again, at which point > >> tempest has already timed out waiting for the delete to complete. > >> > >> We should at least get some patches to add diagnostic logging in these > >> delete flows (or periodic tasks that use the same locks/low-level i/o > >> bound commands?) to try and pinpoint these failures. > >> > >> I think I'm going to propose a skip patch for test_volume_boot_pattern > >> since that just seems to be a never ending cause of pain until these > >> root issues get fixed. > >> > > > > I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 > > had an OR message which was the same as 1396186. > > > > I went ahead and proposed a skip for test_volume_boot_pattern due to bug > > 1373513 [1] until people get on top of debugging it. > > > > I added some notes to bug 1396186, the 3 minute hang seems to be due to a > > vgs call taking ~1 minute and an lvs call taking ~2 minutes. > > > > I'm not sure if those are hit in the volume delete flow or in some periodic > > task, but if there are multiple concurrent worker processes that could be > > hitting those commands at the same time can we look at off-loading one of > > them to a separate thread or something? > > Do we set up devstack to not zero volumes on delete > (CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the > system due to io load. This would get significantly worse with multiple > deletes occurring simultaneously.
Yes, we do that: http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate.sh#n139 and http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n170 it can be overridden, but I don't think that any of the job definitions do that. -Matt Treinish
pgpyqzwGQmI7O.pgp
Description: PGP signature
_______________________________________________ OpenStack-dev mailing list [email protected] http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
