On Nov 25, 2014, at 7:29 AM, Matt Riedemann <mrie...@linux.vnet.ibm.com> wrote:
> > > On 11/25/2014 9:03 AM, Matt Riedemann wrote: >> >> >> On 11/25/2014 8:11 AM, Sean Dague wrote: >>> There is currently a review stream coming into Tempest to add Cinder v2 >>> tests in addition to the Cinder v1 tests. At the same time the currently >>> biggest race fail in the gate related to the projects is >>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder >>> related. >>> >>> I believe these 2 facts are coupled. The number of volume tests we have >>> in tempest is somewhat small, and as such the likelihood of them running >>> simultaneously is also small. However the fact that as the # of tests >>> with volumes goes up we are getting more of these race fails typically >>> means that what's actually happening is 2 vol ops that aren't safe to >>> run at the same time, are. >>> >>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 - >>> with no assignee. >>> >>> So we really needs dedicated diving on this (last bug update with any >>> code was a month ago), otherwise we need to stop adding these tests to >>> Tempest, and honestly start skipping the volume tests if we can't have a >>> repeatable success. >>> >>> -Sean >>> >> >> I just put up an e-r query for a newly opened bug >> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks >> similar to bug 1373513 but without the blocked task error in syslog. >> >> There is a three minute gap between when the volume is being deleted in >> c-vol logs and when we see the volume uuid logged again, at which point >> tempest has already timed out waiting for the delete to complete. >> >> We should at least get some patches to add diagnostic logging in these >> delete flows (or periodic tasks that use the same locks/low-level i/o >> bound commands?) to try and pinpoint these failures. >> >> I think I'm going to propose a skip patch for test_volume_boot_pattern >> since that just seems to be a never ending cause of pain until these >> root issues get fixed. >> > > I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 > had an OR message which was the same as 1396186. > > I went ahead and proposed a skip for test_volume_boot_pattern due to bug > 1373513 [1] until people get on top of debugging it. > > I added some notes to bug 1396186, the 3 minute hang seems to be due to a vgs > call taking ~1 minute and an lvs call taking ~2 minutes. > > I'm not sure if those are hit in the volume delete flow or in some periodic > task, but if there are multiple concurrent worker processes that could be > hitting those commands at the same time can we look at off-loading one of > them to a separate thread or something? Do we set up devstack to not zero volumes on delete (CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the system due to io load. This would get significantly worse with multiple deletes occurring simultaneously. Vish
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev