On Nov 14, 2013, at 1:17 PM, Nikhil Bhatia <[email protected]> wrote:
> I am observing a huge gap between the "total allocated" & 
> "active" counts in the jemalloc stats. The "active" & "mapped"
> correctly point to the RSS and VIRT counters in top. Below
> is a snippet of the stats output. 
> 
> How should I infer this gap? Is this the fragmentation caused
> by the chunk metadata & unused dirty pages?

The gap is due to external fragmentation of small object page runs.  I computed 
per size class fragmentation and overall blame for the fragmented memory:

bin     size    regs    pgs     allocated       cur runs        % of small      
        % of blame
                                                        utilization     frag 
memory
0       8       501     1       50937728        40745   31%     1%      
112368232       4%
1       16      252     1       77020144        21604   88%     2%      
10087184        0%
2       32      126     1       429852096       231731  46%     12%     
504487296       20%
3       48      84      1       774254160       344983  56%     22%     
616717296       24%
4       64      63      1       270561344       102283  66%     8%      
141843712       6%
5       80      50      1       526179760       163248  81%     15%     
126812240       5%
6       96      84      2       66918048        20469   41%     2%      
98143968        4%
7       112     72      2       141823360       31895   55%     4%      
115377920       4%
8       128     63      2       117911808       22666   65%     3%      
64866816        3%
9       160     51      2       104119200       22748   56%     3%      
81504480        3%
10      192     63      3       178081344       20630   71%     5%      
71459136        3%
11      224     72      4       65155104        5327    76%     2%      
20758752        1%
12      256     63      4       48990208        7009    43%     1%      
64050944        2%
13      320     63      5       99602240        10444   47%     3%      
110948800       4%
14      384     63      6       22376448        1897    49%     1%      
23515776        1%
15      448     63      7       19032384        2290    29%     1%      
45600576        2%
16      512     63      8       83511808        4852    53%     2%      
72994304        3%
17      640     51      8       40183040        2979    41%     1%      
57051520        2%
18      768     47      9       17687040        747     66%     1%      9276672 
        0%
19      896     45      10      17929856        730     61%     1%      
11503744        0%
20      1024    63      16      226070528       4142    85%     6%      
41138176        2%
21      1280    51      16      24062720        786     47%     1%      
27247360        1%
22      1536    42      16      9480192         326     45%     0%      
11550720        0%
23      1792    38      17      3695104         223     24%     0%      
11490304        0%
24      2048    65      33      42412032        565     56%     1%      
32800768        1%
25      2560    52      33      27392000        760     27%     1%      
73779200        3%
26      3072    43      33      1959936         65      23%     0%      6626304 
        0%
27      3584    39      35      24493056        235     75%     1%      8354304 
        0%

utilization = allocated / (size * regs * cur runs)
% of small = allocated / total allocated
frag memory = (size * regs * cur runs) - allocated
% of blame = frag memory / total frag memory

In order for fragmentation to be that bad, your application has to have a 
steady state memory usage that is well below its peak usage.  In absolute 
terms, 32- and 48-byte allocations are to blame for nearly half the total 
fragmentation, and they have utilization (1-fragmentation) of 46% and 56%, 
respectively.

The core of the problem is that short-lived and long-lived object allocations 
are being interleaved even during near-peak memory usage, and when the 
short-lived objects are freed, the long-lived objects keep entire page runs 
active, even if almost all neighboring regions have been freed.  jemalloc is 
robust with regard to multiple grow/shrink cycles, in that its layout policies 
keep fragmentation from increasing from cycle to cycle, but it can do very 
little about the external fragmentation that exists during the low-usage time 
periods.  If the application accumulates long-lived objects (i.e. each peak is 
higher than the previous), then the layout policies tend to cause accumulation 
of long-lived objects in low memory, and fragmentation in high memory is 
proportionally small.  Presumably that's not how your application behaves 
though.

You can potentially mitigate the problem by reducing the number of arenas (only 
helps if per thread memory usage spikes are uncorrelated).  Another possibility 
is to segregate short- and long-lived objects into different arenas, but this 
requires that you have reliable (and ideally stable) knowledge of object 
lifetimes.  In practice, segregation is usually very difficult to maintain.  If 
you choose to go this direction, take a look at the "arenas.extend" mallctl 
(for creating an arena that contains long-lived objects), and the 
ALLOCM_ARENA(a) macro argument to the [r]allocm() functions.

> I am purging unused
> dirty pages a bit more aggressively than default (lg_dirty_mult: 5). 
> Should I consider being more aggressive? 

Dirty page purging isn't related to this problem.

> Secondly, I am using 1 arena per CPU core but my application creates
> lots of transient threads making small allocations. Should I consider
> using more arenas to mitigate performance bottlenecks incurred due to
> blocking on per-arena locks?

In general, the more arenas you have, the worse fragmentation is likely to be.  
Use the smallest number of arenas that doesn't unacceptably degrade throughput.

> Finally, looking at the jemalloc stats how should I go about 
> configuring the tcache? My application has a high thread churn & 
> each thread performs lots of short-lived small allocations. Should
> I consider decreasing lg_tcache_max to 4K? 

This probably won't have much effect one way or the other, but setting 
lg_tcache_max set to 12 will potentially reduce memory overhead, so go for it 
if application throughput doesn't degrade unacceptably as a side effect.

It's worth mentioning that the tcache is a cause of fragmentation, because it 
thwarts jemalloc's layout policy of always choosing the lowest available 
region.  Fragmentation may go down substantially if you completely disable the 
tcache, though the potential throughput degradation may be unacceptable.

Jason
_______________________________________________
jemalloc-discuss mailing list
[email protected]
http://www.canonware.com/mailman/listinfo/jemalloc-discuss

Reply via email to