Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Deepak Shetty Fri, 20 Feb 2015 21:44:08 -0800

On Feb 21, 2015 12:26 AM, "Joe Gordon" <[email protected]> wrote:
>
>
>
> On Fri, Feb 20, 2015 at 7:29 AM, Deepak Shetty <[email protected]>
wrote:
>>
>> Hi Jeremy,
>>   Couldn't find anything strong in the logs to back the reason for OOM.
>> At the time OOM happens, mysqld and java processes have the most RAM
hence OOM selects mysqld (4.7G) to be killed.
>>
>> From a glusterfs backend perspective, i haven't found anything
suspicious, and we don't have the logs of glusterfs (which is typically in
/var/log/glusterfs) so can't delve inside glusterfs too much :(
>>
>> BharatK (in CC) also tried to re-create the issue in local VM setup, but
it hasn't yet!
>>
>> Having said that, we do know that we started seeing this issue after we
enabled the nova-assisted-snapshot tests (by changing nova' s policy.json
to enable non-admin to create hyp-assisted snaps). We think that enabling
online snaps might have added to the number of tests and memory load &
thats the only clue we have as of now!
>>
>
> It looks like OOM killer hit while qemu was busy and during
a ServerRescueTest. Maybe libvirt logs would be useful as well?


Thanks for the data point, will look at this test to understand more what's
happening

>
> And I don't see any tempest tests calling assisted-volume-snapshots

Maybe it still hasn't reached to it yet.

Thanks
Deepak

>
> Also this looks odd: Feb 19 18:47:16
devstack-centos7-rax-iad-916633.slave.openstack.org libvirtd[3753]: missing
__com.redhat_reason in disk io error event
>
>
>>
>> So :
>>
>>   1) BharatK  has merged the patch (
https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the
glusterfs job. So no more nova-assisted-snap tests.
>>
>>   2) We also are increasing the timeout of our job in patch (
https://review.openstack.org/#/c/157835/1 ) so that we can get a full run
without timeouts to do a good analysis of the logs (logs are not posted if
the job times out)
>>
>> Can you please re-enable our job, so that we can confirm that disabling
online snap TCs is helping the issue, which if it does, can help us narrow
down the issue.
>>
>> We also plan to monitor & debug over the weekend hence having the job
enabled can help us a lot.
>>
>> thanx,
>> deepak
>>
>>
>> On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <[email protected]>
wrote:
>>>
>>> On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote:
>>> [...]
>>> > For some reason we are seeing the centos7 glusterfs CI job getting
>>> > aborted/ killed either by Java exception or the build getting
>>> > aborted due to timeout.
>>> [...]
>>> > Hoping to root cause this soon and get the cinder-glusterfs CI job
>>> > back online soon.
>>>
>>> I manually reran the same commands this job runs on an identical
>>> virtual machine and was able to reproduce some substantial
>>> weirdness.
>>>
>>> I temporarily lost remote access to the VM around 108 minutes into
>>> running the job (~17:50 in the logs) and the out of band console
>>> also became unresponsive to carriage returns. The machine's IP
>>> address still responded to ICMP ping, but attempts to open new TCP
>>> sockets to the SSH service never got a protocol version banner back.
>>> After about 10 minutes of that I went out to lunch but left
>>> everything untouched. To my excitement it was up and responding
>>> again when I returned.
>>>
>>> It appears from the logs that it runs well past the 120-minute mark
>>> where devstack-gate tries to kill the gate hook for its configured
>>> timeout. Somewhere around 165 minutes in (18:47) you can see the
>>> kernel out-of-memory killer starts to kick in and kill httpd and
>>> mysqld processes according to the syslog. Hopefully this is enough
>>> additional detail to get you a start at finding the root cause so
>>> that we can reenable your job. Let me know if there's anything else
>>> you need for this.
>>>
>>> [1] http://fungi.yuggoth.org/tmp/logs.tar
>>> --
>>> Jeremy Stanley
>>>
>>>
__________________________________________________________________________
>>> OpenStack Development Mailing List (not for usage questions)
>>> Unsubscribe:
[email protected]?subject:unsubscribe
>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>>
>>
>>
__________________________________________________________________________
>> OpenStack Development Mailing List (not for usage questions)
>> Unsubscribe:
[email protected]?subject:unsubscribe
>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>>
>
>
> __________________________________________________________________________
> OpenStack Development Mailing List (not for usage questions)
> Unsubscribe: [email protected]?subject:unsubscribe
> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
>

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [devstack] [Cinder-GlusterFS CI] centos7 gate job abrupt failures

Reply via email to