Re: [Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Alex Zhuravlev Mon, 02 Jun 2008 00:41:18 -0700

what about dmesg output? it's unlikely lustre debug can help here as the problem
seem to be very internal to ldiskfs (mballoc piece of it)


thanks, Alex

Aaron Knister wrote:
> I bumped up debugging, and here's  (below) the last bit of debugging 
> info from lustre that I have on the oss before it went belly up. My 
> system is totally inoperable. Does anybody have any ideas?
> 
> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>  
> About to add this lock:
> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>   
> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 
> 6378)
> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>    
> Node: local
> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>    
> Resource: ffff8103849211c0 (3129942/0)
> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>    
> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>    
> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>  
> forcing cancel of local lock
> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>  
> kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)
> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>  
> started handle ffff8103766dfc78 (0000000000000000)
> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>  
> committing handle ffff8103766dfc78
> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>  
> started handle ffff8103766dfc78 (0000000000000000)
> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>  
> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>  
> kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)
> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>  
> kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).
> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>  
> putref res: ffff8103849211c0 count: 0
> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>  
> kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).
> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>  
> slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).
> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>  
> kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).
> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>  
> committing handle ffff8103766dfc78
> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) 
> putting 3129942: ffff8103599cea98, count = 0
> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>  
> crew4-OST0001: after destroy: set last_objids[0] = 3129941
> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>  
> crew4-OST0001: server last_objid for group 0: 3129941
> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>  
> slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).
> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>  
> last_committed 17896268, xid 3841
> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>  
> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) 
> conn=ffff8103fbe9e2c0 id [EMAIL PROTECTED]
> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>  
> kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).
> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>  
> kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).
> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>  
> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>  
> RPC PUTting export ffff8103848e9000 : new rpc_count 0
> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>  
> PUTting export ffff8103848e9000 : new refcount 4
> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>  
> PUTting export ffff8103848e9000 : new refcount 3
> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>  
> kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).
> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>  
> kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).
> 
> 
> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <[EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]>> wrote:
> 
>     Thank you very much for looking into this. I've attached my dmesg to
>     the bug. i looked at line number 1334 which the panic seems to
>     reference. i can't figure out what its doing though
> 
>     On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> 
>         Aaron Knister wrote:
> 
>             I'm seeing this bug (14465) under heavy load on my OSSes. If
>             I reboot the MDS it appears to help...any ideas? What's the
>             status on this bug?
> 
> 
>         could you attach your dmesg to the bug? as for the status - I'm
>         still not
>         able to reproduce this, neither I found possible cause, sorry.
> 
>         thanks, Alex
> 
> 
> 

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] ldiskfs kernel bug in ldiskfs_mb_use_best_found

Reply via email to