Hi Jeremy, Couldn't find anything strong in the logs to back the reason for OOM. At the time OOM happens, mysqld and java processes have the most RAM hence OOM selects mysqld (4.7G) to be killed.
>From a glusterfs backend perspective, i haven't found anything suspicious, and we don't have the logs of glusterfs (which is typically in /var/log/glusterfs) so can't delve inside glusterfs too much :( BharatK (in CC) also tried to re-create the issue in local VM setup, but it hasn't yet! Having said that,* we do know* that we started seeing this issue after we enabled the nova-assisted-snapshot tests (by changing nova' s policy.json to enable non-admin to create hyp-assisted snaps). We think that enabling online snaps might have added to the number of tests and memory load & thats the only clue we have as of now! So : 1) BharatK has merged the patch ( https://review.openstack.org/#/c/157707/ ) to revert the policy.json in the glusterfs job. So no more nova-assisted-snap tests. 2) We also are increasing the timeout of our job in patch ( https://review.openstack.org/#/c/157835/1 ) so that we can get a full run without timeouts to do a good analysis of the logs (logs are not posted if the job times out) Can you please re-enable our job, so that we can confirm that disabling online snap TCs is helping the issue, which if it does, can help us narrow down the issue. We also plan to monitor & debug over the weekend hence having the job enabled can help us a lot. thanx, deepak On Thu, Feb 19, 2015 at 10:37 PM, Jeremy Stanley <[email protected]> wrote: > On 2015-02-19 17:03:49 +0100 (+0100), Deepak Shetty wrote: > [...] > > For some reason we are seeing the centos7 glusterfs CI job getting > > aborted/ killed either by Java exception or the build getting > > aborted due to timeout. > [...] > > Hoping to root cause this soon and get the cinder-glusterfs CI job > > back online soon. > > I manually reran the same commands this job runs on an identical > virtual machine and was able to reproduce some substantial > weirdness. > > I temporarily lost remote access to the VM around 108 minutes into > running the job (~17:50 in the logs) and the out of band console > also became unresponsive to carriage returns. The machine's IP > address still responded to ICMP ping, but attempts to open new TCP > sockets to the SSH service never got a protocol version banner back. > After about 10 minutes of that I went out to lunch but left > everything untouched. To my excitement it was up and responding > again when I returned. > > It appears from the logs that it runs well past the 120-minute mark > where devstack-gate tries to kill the gate hook for its configured > timeout. Somewhere around 165 minutes in (18:47) you can see the > kernel out-of-memory killer starts to kick in and kill httpd and > mysqld processes according to the syslog. Hopefully this is enough > additional detail to get you a start at finding the root cause so > that we can reenable your job. Let me know if there's anything else > you need for this. > > [1] http://fungi.yuggoth.org/tmp/logs.tar > -- > Jeremy Stanley > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: [email protected]?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
