Hi,

I'm having a strange issue and would like to get closer to understanding it.
With lustre 1.6.2 over o2ib I had some cluster nodes hanging on lustre I/O
processes and rebooted them. No LBUGs seen, only RDMA failures. Only the
client nodes were rebooted.

After re-mounting the lustre filesystem, "ls" hangs (traceback is below).
But when the lustre FS is unmounted with "umount -f" ls returns the correct
output.

Any idea on what could be wrong? I noticed that on the buggy clients
   cat /proc/fs/lustre/ldlm/namespaces/*/lock_count
shows something very different from the output of the "good" clients. Only
the MGC* lock_count is 1, the others are zero. Is there a way to fix this?

Best regards,
Erich


===== traceback of hanging "ls" command =====================================
ls            S 00000000ffffc29c     0  4296   4092                     (NOTLB)
000001007c049958 0000000000000002 000001007e2f3030 ffffffff00000074
       00000101376bc808 0000000039288440 0000010080051000 00000001a02f67d3
       000001007e004800 000000000000205a
Call Trace:<ffffffff8013f4a4>{__mod_timer+293} 
<ffffffff80320c33>{schedule_timeout+367}
       <ffffffff8013fed4>{process_timeout+0} 
<ffffffffa030b474>{:ptlrpc:ptlrpc_set_wait+932}
       <ffffffff801335c2>{default_wake_function+0} 
<ffffffffa03096b0>{:ptlrpc:ptlrpc_expired_set+0}
       <ffffffffa0307710>{:ptlrpc:ptlrpc_interrupted_set+0}
       <ffffffffa03096b0>{:ptlrpc:ptlrpc_expired_set+0} 
<ffffffffa0307710>{:ptlrpc:ptlrpc_interrupted_set+0}
       <ffffffffa042637d>{:lustre:ll_glimpse_size+1613} 
<ffffffffa02e0e6a>{:ptlrpc:__ldlm_handle2lock+794}
       <ffffffffa02dc035>{:ptlrpc:lock_res_and_lock+53} 
<ffffffffa02dc035>{:ptlrpc:lock_res_and_lock+53}
       <ffffffffa02dc06f>{:ptlrpc:unlock_res_and_lock+31}
       <ffffffffa02e0aaa>{:ptlrpc:ldlm_lock_decref_internal+746}
       <ffffffffa0424b40>{:lustre:ll_extent_lock_callback+0}
       <ffffffffa02f2960>{:ptlrpc:ldlm_completion_ast+0} 
<ffffffffa0424ee0>{:lustre:ll_glimpse_callback+0}
       <ffffffffa0416c4f>{:lustre:ll_intent_drop_lock+143}
       <ffffffffa0431318>{:lustre:ll_inode_revalidate_it+1528}
       <ffffffffa044f360>{:lustre:ll_mdc_blocking_ast+0} 
<ffffffff8018e106>{dput+55}
       <ffffffff801859eb>{__link_path_walk+3928} 
<ffffffff80185b75>{link_path_walk+179}
       <ffffffffa04313c4>{:lustre:ll_getattr_it+36} 
<ffffffffa04314f5>{:lustre:ll_getattr+53}
       <ffffffff80180257>{vfs_getattr64_it+146} 
<ffffffff80180532>{vfs_lstat64+100}
       <ffffffff8016838d>{handle_mm_fault+354} <ffffffff801ea149>{__up_read+16}
       <ffffffff80123991>{do_page_fault+577} 
<ffffffffa0416c70>{:lustre:ll_intent_release+0}
       <ffffffff80180891>{sys_newlstat+17} <ffffffff8018a01c>{vfs_readdir+176}
       <ffffffff8018a428>{sys_getdents64+166} <ffffffff80110c2d>{error_exit+0}
       <ffffffff8011022a>{system_call+126}

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to