Lately I've got reports of one of our server applications coring in 
je_bitmap_set. 

There is a common pattern with the stack traces from all the cores.  Starting 
at the call to malloc the stack trace is always the same (shown below). The 
frames leading up to the call to malloc don't show any common pattern, plus 
there is nothing immediately obvious in the application code that calls malloc. 
We are running a standard jemalloc 3.6.0 on a 64-bit Redhat 6.4

This issue cannot be reproduced in our test setup, so I'm looking for ideas on 
how to root cause the issue from the core file.  

Here are details on my analysis to date

#0  0x00002b997a4a65df in je_bitmap_set (bit=18446744073709551555, 
    binfo=0x2b997a6d1540 <je_arena_bin_info+768>, bitmap=0x2b999e660010)
    at include/jemalloc/internal/bitmap.h:101
#1  je_bitmap_sfu (binfo=0x2b997a6d1540 <je_arena_bin_info+768>, 
    bitmap=0x2b999e660010) at include/jemalloc/internal/bitmap.h:140
#2  arena_run_reg_alloc (bin_info=0x2b997a6d1518 <je_arena_bin_info+728>, 
    run=0x2b999e660000) at src/arena.c:291
#3  je_arena_tcache_fill_small (arena=0x2b997b07f700, tbin=0x2b99a5406108, 
    binind=7, prof_accumbytes=<optimized out>) at src/arena.c:1479
#4  0x00002b997a4c55af in je_tcache_alloc_small_hard (tcache=<optimized out>, 
    tbin=0x2b99a5406108, binind=<optimized out>) at src/tcache.c:72
#5  0x00002b997a49a3fd in je_tcache_alloc_small (size=<optimized out>, 
    tcache=0x2b99a5406000, zero=false)
    at include/jemalloc/internal/tcache.h:303
#6  je_arena_malloc (try_tcache=true, zero=false, size=<optimized out>, 
    arena=0x0) at include/jemalloc/internal/arena.h:957
#7  je_imalloct (arena=0x0, try_tcache=true, size=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:771
#8  je_imalloc (size=<optimized out>)
    at include/jemalloc/internal/jemalloc_internal.h:780
#9  malloc (size=<optimized out>) at src/jemalloc.c:929

The core is triggered at line 101 of bitmap_set below

88        JEMALLOC_INLINE void
89        bitmap_set(bitmap_t *bitmap, const bitmap_info_t *binfo, size_t bit)
90        {
91                size_t goff;
92                bitmap_t *gp;
93                bitmap_t g;
94        
95                assert(bit < binfo->nbits);
96                assert(bitmap_get(bitmap, binfo, bit) == false);
97                goff = bit >> LG_BITMAP_GROUP_NBITS;
98                gp = &bitmap[goff];
99                g = *gp;
100              assert(g & (1LU << (bit & BITMAP_GROUP_NBITS_MASK)));
101              g ^= 1LU << (bit & BITMAP_GROUP_NBITS_MASK);
       
In all the cores the value of "bit" is 0xffffffffffff. That then results in "g" 
being an invalid pointer. For example, this is what I see in one of the cores

    (gdb) p bitmap
    $15 = (bitmap_t *) 0x2b999e660010
    (gdb) p gp
    $16 = (bitmap_t *) 0x20002b999e660008
    (gdb) p g
    Cannot access memory at address 0x20002b999e660008

Here is an analysis of how "bit" got set to 0xffffffffffff in bitmap_sfu from 
the same core file

121     /* sfu: set first unset. */
122     JEMALLOC_INLINE size_t
123     bitmap_sfu(bitmap_t *bitmap, const bitmap_info_t *binfo)
124     {
125             size_t bit;
126             bitmap_t g;
127             unsigned i;
128     
129             assert(bitmap_full(bitmap, binfo) == false);
130     
131             i = binfo->nlevels - 1;
132             g = bitmap[binfo->levels[i].group_offset];
133             bit = ffsl(g) - 1;
134             while (i > 0) {
135                     i--;
136                     g = bitmap[binfo->levels[i].group_offset + bit];
137                     bit = (bit << LG_BITMAP_GROUP_NBITS) + (ffsl(g) - 1);
138             }
139     
140             bitmap_set(bitmap, binfo, bit);
141             return (bit);
142     }


(gdb) p binfo->nlevels
$3 = 2
(gdb) p binfo->levels[1].group_offset
$4 = 2
(gdb) p bitmap[2]
$5 = 1

(gdb) hexdump memory bitmap bitmap+16
00000000  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000010  01 00 00 00 00 00 00 00  50 79 eb 08 74 2b 00 00  |........Py..t+..|
00000020  b0 79 eb 08 74 2b 00 00  00 00 00 00 01 00 00 00  |.y..t+..........|
00000030  02 00 00 00 01 00 00 00  01 00 00 00 01 00 00 00  |................|
00000040  b0 02 ea 08 74 2b 00 00  40 18 ea 08 74 2b 00 00  |....t+..@...t+..|
00000050  b0 12 ea 08 74 2b 00 00  00 00 00 00 00 00 00 00  |....t+..........|
00000060  00 00 00 00 09 00 00 00  e5 ff ff ff 12 00 00 00  |................|
00000070  02 00 00 00 01 00 00 00  00 00 00 00 00 00 00 00  |................|

Above means that "i" will be 1 at line 131, "g" will be 1 at line 132 and "bit" 
will be 0 at line 133.

The while loop will run once (because "I" is 1), so "g" is set to 1 at line 
136, then bit get set to 0xffffffffffff at line 137.

That's as far as I've got.

Until the start of this week I didn't know the jemalloc code at all, so I'm not 
sure what my analysis infers about the root cause for the core. Any suggestions 
about what to check next would be most welcome.

cheers
Paul

_______________________________________________
jemalloc-discuss mailing list
jemalloc-discuss@canonware.com
http://www.canonware.com/mailman/listinfo/jemalloc-discuss

Reply via email to