RE: jemalloc coring in je_bitmap_set

Paul Marquess Tue, 18 Aug 2015 11:54:01 -0700

> From: Jason Evans [mailto:jas...@canonware.com] 
 
> On Aug 18, 2015, at 8:49 AM, Paul Marquess <paul.marqu...@owmobility.com> 
> wrote:
> >> From: Jason Evans [mailto:jas...@canonware.com] 
> >> 
> >> On Aug 18, 2015, at 5:14 AM, Paul Marquess <paul.marqu...@owmobility.com> 
> >> wrote:
> >>> I see a reference to a fix for arena_tcache_fill_small and corruption in 
> >>> the 4.0 ChangeLog. Any chance it could be the root cause for this issue?
> >> 
> >> It's possible, but the failure mode for that bug depends on failing to map 
> >> memory (i.e. extreme memory pressure).
> > 
> > do you mean a failure in the call to mmap? Assume that isn't necessarily 
> > catastrophic (otherwise I assume you would assert straight away).
> 
> Yes, mmap() and sbrk() failure.  It should simply result in malloc() 
> returning NULL, but the arena_tcache_fill_small bug you mentioned caused 
> corruption that would later cause crashes.


Guess we need to wrap jemalloc's malloc and get it to assert when it gets a 
null. Perhaps get a dump of jemallocs state -- would  the stats interface in 
jemalloc will still be operational if we are OOM? Alternative is to get the 
stats from the core --  I see there are a couple of core file postmortem 
scripts for jemalloc knocking about, but none seem to support 3.6. 

Something else has occurred to me - we had a problem with THP and 
uninterruptable sleep (~30 seconds) very recently that was fixed by tuning  the 
swappiness parameter. When researching that I spotted a number of threads that 
suggested that the combination of THP and jemalloc can result in memory growth. 
 This thread is an example 
https://www.digitalocean.com/company/blog/transparent-huge-pages-and-alternative-memory-allocators/
 . I know it's too much of a stretch to suggest that this is the root cause of 
the OOM, but if it does cause memory growth it won't help.

Do you have any feeling whether it is safe to have jemalloc and THP at the same 
time?


> > Is there anything in jemalloc (or other tools) I can do to root cause why 
> > that is happening?
> 
> Valgrind is great.  

Indeed it is, and it is a tool we make frequent use of. Problem is its waaaay 
to slow. The issue only happens on our live server. We've attempted to trigger 
the issue with a load test, but it has never happened. 

> There's ASAN (address sanitizer) as well.  

Yep, we've started using that recently. It's found a number of issues for us. 
Very nice it is too. 

> jemalloc with --enable-debug and MALLOC_CONF=tcache:false can catch quite a 
> few issues as well.

I've just dipped my toe into jemalloc's debug features. Need to research that 
some more. 

cheers 
Paul
_______________________________________________
jemalloc-discuss mailing list
jemalloc-discuss@canonware.com
http://www.canonware.com/mailman/listinfo/jemalloc-discuss

RE: jemalloc coring in je_bitmap_set

Reply via email to