On Nov 25, 2014, at 7:29 AM, Matt Riedemann <mrie...@linux.vnet.ibm.com> wrote:

> 
> 
> On 11/25/2014 9:03 AM, Matt Riedemann wrote:
>> 
>> 
>> On 11/25/2014 8:11 AM, Sean Dague wrote:
>>> There is currently a review stream coming into Tempest to add Cinder v2
>>> tests in addition to the Cinder v1 tests. At the same time the currently
>>> biggest race fail in the gate related to the projects is
>>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
>>> related.
>>> 
>>> I believe these 2 facts are coupled. The number of volume tests we have
>>> in tempest is somewhat small, and as such the likelihood of them running
>>> simultaneously is also small. However the fact that as the # of tests
>>> with volumes goes up we are getting more of these race fails typically
>>> means that what's actually happening is 2 vol ops that aren't safe to
>>> run at the same time, are.
>>> 
>>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
>>> with no assignee.
>>> 
>>> So we really needs dedicated diving on this (last bug update with any
>>> code was a month ago), otherwise we need to stop adding these tests to
>>> Tempest, and honestly start skipping the volume tests if we can't have a
>>> repeatable success.
>>> 
>>>    -Sean
>>> 
>> 
>> I just put up an e-r query for a newly opened bug
>> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
>> similar to bug 1373513 but without the blocked task error in syslog.
>> 
>> There is a three minute gap between when the volume is being deleted in
>> c-vol logs and when we see the volume uuid logged again, at which point
>> tempest has already timed out waiting for the delete to complete.
>> 
>> We should at least get some patches to add diagnostic logging in these
>> delete flows (or periodic tasks that use the same locks/low-level i/o
>> bound commands?) to try and pinpoint these failures.
>> 
>> I think I'm going to propose a skip patch for test_volume_boot_pattern
>> since that just seems to be a never ending cause of pain until these
>> root issues get fixed.
>> 
> 
> I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 
> had an OR message which was the same as 1396186.
> 
> I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
> 1373513 [1] until people get on top of debugging it.
> 
> I added some notes to bug 1396186, the 3 minute hang seems to be due to a vgs 
> call taking ~1 minute and an lvs call taking ~2 minutes.
> 
> I'm not sure if those are hit in the volume delete flow or in some periodic 
> task, but if there are multiple concurrent worker processes that could be 
> hitting those commands at the same time can we look at off-loading one of 
> them to a separate thread or something?

Do we set up devstack to not zero volumes on delete 
(CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the 
system due to io load. This would get significantly worse with multiple deletes 
occurring simultaneously.

Vish

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to