Another potentially interesting devstack service that may help us to understand our memory usage is peakmem_tracker. At this point, it's not enabled anywhere. I proposed devstack-gate patch to enable it at: https://review.openstack.org/#/c/434511/
On Wed, Feb 15, 2017 at 12:38 PM, Ihar Hrachyshka <ihrac...@redhat.com> wrote: > Another potentially relevant info is, we saw before that oom-killer is > triggered while 8gb of swap are barely used. This behavior is hard to > explain, since we set kernel swappiness sysctl knob to 30: > > https://github.com/openstack-infra/devstack-gate/blob/master/functions.sh#L432 > > (and any value above 0 means that if memory is requested, and there is > swap available to fulfill it, it will not fail to allocate memory; > swappiness only controls willingness of kernel to swap process pages > instead of dropping disk cache entries, it may affect performance, but > it should not affect malloc behavior). > > The only reason I can think of for a memory allocation request to > trigger the trap when swap is free is when the memory request is for a > RAM-locked page (it can either be memory locked with mlock(2), or > mmap(2) when MAP_LOCKED used). To understand if that's the case in > gate, I am adding a new mlock_tracker service to devstack: > https://review.openstack.org/#/c/434470/ > > The patch that enables the service in Pike+ gate is: > https://review.openstack.org/#/c/434474/ > > Thanks, > Ihar > > On Wed, Feb 15, 2017 at 5:21 AM, Andrea Frittoli > <andrea.fritt...@gmail.com> wrote: >> Some (new?) data on the oom kill issue in the gate. >> >> I filed a new bug / E-R query yet for the issue [1][2] since it looks to me >> like the issue is not specific to mysqld - oom-kill will just pick the best >> candidate, which in most cases happens to be mysqld. The next most likely >> candidate to show errors in the logs is keystone, since token requests are >> rather frequent, more than any other API call probably. >> >> According to logstash [3] all failures identified by [2] happen on RAX nodes >> [3], which I hadn't realised before. >> >> Comparing dstat data between the failed run and a successful on an OVH node >> [4], the main difference I can spot is free memory. >> For the same test job, the free memory tends to be much lower, quite close >> to zero for the majority of the time on the RAX node. My guess is that an >> unlucky scheduling of tests may cause a slightly higher peak in memory usage >> and trigger the oom-kill. >> >> I find it hard to relate lower free memory to a specific cloud provider / >> underlying virtualisation technology, but maybe someone has an idea about >> how that could be? >> >> Andrea >> >> [0] >> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28 >> [1] https://bugs.launchpad.net/tempest/+bug/1664953 >> [2] https://review.openstack.org/434238 >> [3] >> http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22 >> [4] >> http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz >> >> On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo >> <majop...@redhat.com> wrote: >>> >>> Jeremy Stanley wrote: >>> >>> >>> > It's an option of last resort, I think. The next consistent flavor >>> > up in most of the providers donating resources is double the one >>> > we're using (which is a fairly typical pattern in public clouds). As >>> > aggregate memory constraints are our primary quota limit, this would >>> > effectively halve our current job capacity. >>> >>> Properly coordinated with all the cloud the providers, they could create >>> flavours which are private but available to our tenants, where a 25-50% more >>> RAM would be just enough. >>> >>> I agree that should probably be a last resort tool, and we should keep >>> looking for proper ways to find where we consume unnecessary RAM and make >>> sure that's properly freed up. >>> >>> It could be interesting to coordinate such flavour creation in the mean >>> time, even if we don't use it now, we could eventually test it or put it to >>> work if we find trapped anytime later. >>> >>> >>> On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann <mriede...@gmail.com> >>> wrote: >>>> >>>> On 2/5/2017 1:19 PM, Clint Byrum wrote: >>>>> >>>>> >>>>> Also I wonder if there's ever been any serious consideration given to >>>>> switching to protobuf? Feels like one could make oslo.versionedobjects >>>>> a wrapper around protobuf relatively easily, but perhaps that's already >>>>> been explored in a forum that I wasn't paying attention to. >>>> >>>> >>>> I've never heard of anyone attempting that. >>>> >>>> -- >>>> >>>> Thanks, >>>> >>>> Matt Riedemann >>>> >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev