what about dmesg output? it's unlikely lustre debug can help here as the problem
seem to be very internal to ldiskfs (mballoc piece of it)

thanks, Alex

Aaron Knister wrote:
> I bumped up debugging, and here's  (below) the last bit of debugging 
> info from lustre that I have on the oss before it went belly up. My 
> system is totally inoperable. Does anybody have any ideas?
> 
> 00010000:00001000:4:1212157576.884909:0:6378:0:(ldlm_resource.c:865:ldlm_resource_add_lock())
>  
> About to add this lock:
> 00010000:00001000:4:1212157576.884910:0:6378:0:(ldlm_lock.c:1718:ldlm_lock_dump())
>   
> -- Lock dump: ffff81036e61f1c0/0x1fb135e1e3fd2cc6 (rc: 3) (pos: 0) (pid: 
> 6378)
> 00010000:00001000:4:1212157576.884913:0:6378:0:(ldlm_lock.c:1726:ldlm_lock_dump())
>    
> Node: local
> 00010000:00001000:4:1212157576.884914:0:6378:0:(ldlm_lock.c:1735:ldlm_lock_dump())
>    
> Resource: ffff8103849211c0 (3129942/0)
> 00010000:00001000:4:1212157576.884915:0:6378:0:(ldlm_lock.c:1740:ldlm_lock_dump())
>    
> Req mode: PW, grant mode: PW, rc: 3, read: 0, write: 1 flags: 0x80004000
> 00010000:00001000:4:1212157576.884917:0:6378:0:(ldlm_lock.c:1746:ldlm_lock_dump())
>    
> Extent: 0 -> 18446744073709551615 (req 0-18446744073709551615)
> 00010000:00000040:4:1212157576.884920:0:6378:0:(ldlm_lock.c:615:ldlm_lock_decref_internal())
>  
> forcing cancel of local lock
> 00010000:00000010:4:1212157576.884922:0:6378:0:(ldlm_lockd.c:1357:ldlm_bl_to_thread())
>  
> kmalloced 'blwi': 120 at ffff81040e90a340 (tot 49135175)
> 00002000:00000040:4:1212157576.884925:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>  
> started handle ffff8103766dfc78 (0000000000000000)
> 00002000:00000040:4:1212157576.884930:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>  
> committing handle ffff8103766dfc78
> 00002000:00000040:4:1212157576.884931:0:6378:0:(lustre_fsfilt.h:194:fsfilt_start_log())
>  
> started handle ffff8103766dfc78 (0000000000000000)
> 00000020:00000040:4:1212157576.884957:0:5557:0:(lustre_handles.c:121:class_handle_unhash_nolock())
>  
> removing object ffff81036e61f1c0 with handle 0x1fb135e1e3fd2cc6 from hash
> 00000100:00000010:4:1212157576.884960:0:5557:0:(client.c:394:ptlrpc_prep_set())
>  
> kmalloced 'set': 104 at ffff8104012d38c0 (tot 49135279)
> 00000100:00000010:4:1212157576.884962:0:5557:0:(client.c:457:ptlrpc_set_destroy())
>  
> kfreed 'set': 104 at ffff8104012d38c0 (tot 49135175).
> 00010000:00000040:4:1212157576.884964:0:5557:0:(ldlm_resource.c:818:ldlm_resource_putref())
>  
> putref res: ffff8103849211c0 count: 0
> 00010000:00000010:4:1212157576.884969:0:5557:0:(ldlm_resource.c:828:ldlm_resource_putref())
>  
> kfreed 'res->lr_lvb_data': 40 at ffff810379ded880 (tot 49135135).
> 00010000:00000010:4:1212157576.885000:0:5557:0:(ldlm_resource.c:829:ldlm_resource_putref())
>  
> slab-freed 'res': 224 at ffff8103849211c0 (tot 49135135).
> 00010000:00000010:4:1212157576.885002:0:5557:0:(ldlm_lockd.c:1657:ldlm_bl_thread_main())
>  
> kfreed 'blwi': 120 at ffff81040e90a340 (tot 49134791).
> 00002000:00000040:4:1212157576.885623:0:6378:0:(lustre_fsfilt.h:270:fsfilt_commit())
>  
> committing handle ffff8103766dfc78
> 00002000:00000002:4:1212157576.885625:0:6378:0:(filter.c:148:f_dput()) 
> putting 3129942: ffff8103599cea98, count = 0
> 00002000:00080000:4:1212157576.885627:0:6378:0:(filter.c:2689:filter_destroy_precreated())
>  
> crew4-OST0001: after destroy: set last_objids[0] = 3129941
> 00002000:00000002:4:1212157576.885630:0:6378:0:(filter.c:607:filter_update_last_objid())
>  
> crew4-OST0001: server last_objid for group 0: 3129941
> 00002000:00000010:4:1212157576.912615:0:6485:0:(fsfilt-ldiskfs.c:747:fsfilt_ldiskfs_cb_func())
>  
> slab-freed 'fcb': 56 at ffff810371404920 (tot 49134335).
> 00010000:00000040:4:1212157576.912669:0:6378:0:(ldlm_lib.c:1556:target_committed_to_req())
>  
> last_committed 17896268, xid 3841
> 00000100:00000040:4:1212157576.912674:0:6378:0:(connection.c:191:ptlrpc_connection_addref())
>  
> connection=ffff8103fbe9e2c0 refcount 10 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912678:0:6378:0:(niobuf.c:46:ptl_send_buf()) 
> conn=ffff8103fbe9e2c0 id [EMAIL PROTECTED]
> 00000400:00000010:4:1212157576.912680:0:6378:0:(lib-lnet.h:247:lnet_md_alloc())
>  
> kmalloced 'md': 136 at ffff81040cb6cb80 (tot 9568949).
> 00000400:00000010:4:1212157576.912683:0:6378:0:(lib-lnet.h:295:lnet_msg_alloc())
>  
> kmalloced 'msg': 336 at ffff8104285e1e00 (tot 9569285).
> 00000100:00000040:4:1212157576.912693:0:6378:0:(connection.c:150:ptlrpc_put_connection())
>  
> connection=ffff8103fbe9e2c0 refcount 9 to 172.18.0.10 
> <http://172.18.0.10>@o2ib
> 00000100:00000040:4:1212157576.912695:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>  
> RPC PUTting export ffff8103848e9000 : new rpc_count 0
> 00000100:00000040:4:1212157576.912697:0:6378:0:(service.c:648:ptlrpc_server_handle_request())
>  
> PUTting export ffff8103848e9000 : new refcount 4
> 00000100:00000040:4:1212157576.912699:0:6378:0:(service.c:652:ptlrpc_server_handle_request())
>  
> PUTting export ffff8103848e9000 : new refcount 3
> 00000400:00000010:4:1212157576.912741:0:5351:0:(lib-lnet.h:269:lnet_md_free())
>  
> kfreed 'md': 136 at ffff81040cb6cb80 (tot 9569149).
> 00000400:00000010:4:1212157576.912744:0:5351:0:(lib-lnet.h:312:lnet_msg_free())
>  
> kfreed 'msg': 336 at ffff8104285e1e00 (tot 9568813).
> 
> 
> On Wed, May 28, 2008 at 8:03 PM, Aaron Knister <[EMAIL PROTECTED] 
> <mailto:[EMAIL PROTECTED]>> wrote:
> 
>     Thank you very much for looking into this. I've attached my dmesg to
>     the bug. i looked at line number 1334 which the panic seems to
>     reference. i can't figure out what its doing though
> 
>     On Wed, May 28, 2008 at 4:54 PM, Alex Zhuravlev
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
> 
>         Aaron Knister wrote:
> 
>             I'm seeing this bug (14465) under heavy load on my OSSes. If
>             I reboot the MDS it appears to help...any ideas? What's the
>             status on this bug?
> 
> 
>         could you attach your dmesg to the bug? as for the status - I'm
>         still not
>         able to reproduce this, neither I found possible cause, sorry.
> 
>         thanks, Alex
> 
> 
> 

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to