[mdb-discuss] mem corruption analysis using kmem logs

venugopal iyer Tue, 18 Sep 2007 09:56:50 -0700 (PDT)

Hello:

I am looking at a possible mem corruption issue, and am trying to make
sense out of the kmem log info. This is a non-debug kernel with kmem_flags
set to 0xf (str_ftnever is 1).


The problem seems to be  that freemsg() tries to free an already freed dblk.
The mp (in the core) has a valid b_datap (dblk), while the registers holding
the dblk info. from the mp, have 0xdeadbeefdeadbeef.  So, I am not sure if
somehow the mblk has been reallocated to get a new dblk.

> $c
freemsg+0x79()
mir_close+0x48()
rmm_close+0x11()
qdetach+0x82()
strclose+0x3e4()
device_close+0xf0()
spec_close+0x178()
fop_close+0x2c()
closef+0x62()
closeandsetf+0x246()
close+0xb()
sys_syscall32+0x101()

>  $r
%rax = 0x0000000000000001                 %r9  = 0x000000000000000f
%rbx = 0xdeadbeefdeadbeef                 %r10 = 0x0000000000000000
%rcx = 0xffffffff869f6800                 %r11 = 0x0000000000000007
%rdx = 0xfffffe80e08e0800                 %r12 = 0xdeadbeefdeadbeef
%rsi = 0xdeadbeefdeadbeef                 %r13 = 0xfffffe8635459558
%rdi = 0xfffffe80e32a25c0                 %r14 = 0xdeadbeefdeadbeef
%r8  = 0xfffffe8000b1ef10                 %r15 = 0x0000000000000000

%rip = 0xfffffffffb9e62e9 freemsg+0x79
%rbp = 0xfffffe8000b1ec50
%rsp = 0xfffffe8000b1ec30
%rflags = 0x00010246
   id=0 vip=0 vif=0 ac=0 vm=0 rf=1 nt=0 iopl=0x0
     status=<of,df,IF,tf,sf,ZF,af,PF,cf>

        freemsg+0xb:                    movq   %rdi,%rbx
[...]
        freemsg+0x28:                   movq   0x28(%rbx),%r12 (b_datap)
        freemsg+0x2c:                   movq   0x10(%rbx),%r14 (b_cont)

and freemsg+0x79 is:
        call   *0x30(%r12)      (call to db_free)

given:
        freemsg+0x70:                   movq   %rbx,%rdi
        freemsg+0x73:                   movq   %r14,%rbx
        freemsg+0x76:                   movq   %r12,%rsi

I am assuming that %rdi is the mblk in question.

> 0xfffffe80e32a25c0::print mblk_t b_datap b_cont
b_datap = 0xffffffffa9806600
b_cont = 0xffffffff8733ca80

> 0xffffffffa9806600::whatis
ffffffffa9806600 is ffffffffa9806600+0, bufctl ffffffffa981bb18 allocated from
streams_dblk_80

I am not sure if %rcx could be the old b_datap, since using its bufctl:

> ffffffff8697b7d8::bufctl -v
             ADDR          BUFADDR        TIMESTAMP           THREAD
                             CACHE          LASTLOG         CONTENTS
ffffffff8697b7d8 ffffffff869f6800      64361491591 fffffe80e08e08e0
                  ffffffff80067008 ffffffff814fd2c0 ffffffff81e60e20
                  kmem_cache_free_debug+0xf4
                  kmem_cache_free+0x43
                  dblk_lastfree+0x3d
                  freemsg+0x7e
                  mir_close+0x48
                  rmm_close+0x11
                  qdetach+0x82
                  strclose+0x3e4
                  device_close+0xf0
                  spec_close+0x178
                  fop_close+0x2c
                  closef+0x62
                  closeandsetf+0x246
                  close+0xb

In looking at the log for the new dblk - 0xffffffffa9806600 - I see
the following entries (using ::walk kmem_log | ::bufctl -va
<0xffffffffa9806600's bufctl>)

[...]
ffffffff807f2d40 ffffffffa9806600      642a445e35e fffffe8123a693a0
                  ffffffff80065008 ffffffff807e4f40 ffffffff82949d60
                  kmem_cache_free_debug+0xf4
                  kmem_cache_free+0x43
                  dblk_lastfree+0x3d
                  freemsg+0x7e
                  tcp_rput_data+0x16ff
                  squeue_drain+0xf0
                  squeue_enter+0xb1
                  tcp_wput+0x86
                  putnext+0x1f1
                  mir_wput+0x153
                  rmm_wput+0x11
                  put+0x1b0
                  svc_cots_ksend+0x8d
                  svc_sendreply+0x48
                  common_dispatch+0x3b9

ffffffff807e47c0 ffffffffa9806600      642a42605d8 fffffe8123a693a0
                  ffffffff80065008 ffffffff807c6400 ffffffff82944e10
                  kmem_cache_free_debug+0xf4
                  kmem_cache_free+0x43
                  dblk_lastfree+0x3d
                  freemsg+0x7e
                  tcp_rput_data+0x16ff
                  squeue_drain+0xf0
                  squeue_enter+0xb1
                  tcp_wput+0x86
                  putnext+0x1f1
                  mir_wput+0x153
                  rmm_wput+0x11
                  put+0x1b0
                  svc_cots_ksend+0x8d
                  svc_sendreply+0x48
                  common_dispatch+0x3b9

ffffffff807c6400 ffffffffa9806600      642a3d90aca fffffe80000b3c80
                  ffffffff80065008 ffffffff807b7040 ffffffff82935908
                  kmem_cache_alloc_debug+0x1fa
                  kmem_cache_alloc+0x6f
                  allocb+0x77
                  nge_recv_packet+0x9a
                  nge_recv_ring+0x161
                  nge_receive+0x31
                  nge_intr_handle+0x13d
                  nge_chip_intr+0x7c
                  av_dispatch_autovect+0x78
[...]

(there are quite a few transactions before and after these)

a ::kmem_log (for ffffffffa9806600) has the following (strictly in sequence)

[...]
     ffffffff807f2d40 ffffffffa9806600      642a445e35e fffffe8123a693a0
     ffffffff807e47c0 ffffffffa9806600      642a42605d8 fffffe8123a693a0
     ffffffff807c6400 ffffffffa9806600      642a3d90aca fffffe80000b3c80
[...]

What I am wondering is if the log above is complete, i.e. if there are 2
frees for the dblk or if the logging info. is missing an alloc in between.
I suppose it is the latter? since the second free would have otherwise
panic'd the system (even a non-debug).

Another reason to suspect that I am missing some log records is when I
look at the result of "::walk kmem_log | ::bufctl -va
<0xfffffe80e32a25c0's bufctl>" which has:

[...]
ffffffff80f94f40 fffffe80e32a25c0      64308cdc495 fffffe88eccdec00
                  ffffffff80063008 ffffffff80f86000 ffffffff81c7a300
                  kmem_cache_free_debug+0xf4
                  kmem_cache_free+0x43
                  dblk_destructor+0x25
                  kmem_cache_free_debug+0x1e7
                  kmem_cache_free+0x43
                  dblk_lastfree+0x3d
                  freemsg+0x7e
                  tcp_rput_data+0x16ff
                  squeue_drain+0xf0
                  squeue_enter+0xb1
                  tcp_wput+0x86
                  putnext+0x1f1
                  mir_wput+0x153
                  rmm_wput+0x11
                  put+0x1b0

ffffffff80f86000 fffffe80e32a25c0      64308ad8cf9 fffffe80000bfc80
                  ffffffff80063008 ffffffff80f85f40 ffffffff81c74da0
                  kmem_cache_alloc_debug+0x1fa
                  kmem_cache_alloc+0x6f
                  dblk_constructor+0x36
                  kmem_cache_alloc_debug+0x276
                  kmem_cache_alloc+0x6f
                  allocb+0x77
                  nge_recv_packet+0x9a
                  nge_recv_ring+0x161
                  nge_receive+0x31
                  nge_intr_handle+0x13d
                  nge_chip_intr+0x7c
                  av_dispatch_autovect+0x78

ffffffff80f85b80 fffffe80e32a25c0      64308ac93b4 fffffe88eccdec00
                  ffffffff80063008 ffffffff80f85ac0 ffffffff81c74c60
                  kmem_cache_alloc_debug+0x1fa
                  kmem_cache_alloc+0x6f
                  dblk_constructor+0x36
                  kmem_cache_alloc_debug+0x276
                  kmem_cache_alloc+0x6f
                  allocb+0x77
                  allocb_tmpl+0x11
                  copyb+0x72
                  copymsg+0x21
                  copymsgchain+0x28
                  i_dls_link_ether_loopback+0xd9
                  mac_txloop+0x91
                  dls_tx+0xe
                  str_mdata_fastpath_put+0x1f
                  tcp_send_data+0x650
[...]

(there are quite a few transactions before and after these).

I can't see how there could be two alloc's for the mblk without a free 
somewhere in between.

a ::kmem_log has the following for fffffe80e32a25c0 (strictly in sequence):

[...]
     ffffffff80f94f40 fffffe80e32a25c0      64308cdc495 fffffe88eccdec00
     ffffffff80f86000 fffffe80e32a25c0      64308ad8cf9 fffffe80000bfc80
     ffffffff80f85b80 fffffe80e32a25c0      64308ac93b4 fffffe88eccdec00
[...]

There are no alloc failures in the core. kmem_verify also states everything is
clean.

Basically, I am not sure how complete/reliable the logging info. is and if I
can conclude that there is a problem just by looking at the logs above (I
guess I can't, but want to confirm). If more info. is needed, I can include.

thanks, in advance,

-venu

[mdb-discuss] mem corruption analysis using kmem logs

Reply via email to