Hello,

All servers and clients are having Lustre 1.8, on SLES 10 SP2. Clients  
use patchless kernels, using same base revision as the ones for the  
patched kernel servers.
We recurrently encounter this error :

Server log :
------------
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino  
5606195: cookie 0x5ed7d8c3d1299f40  r...@ffff810065a60400  
x1308791892785337/t0  
o35->4f104403-eb03-83be-2910-2fd7cc260...@net_0x20000c0a84410_uuid:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error  
(-116)  r...@ffff810065a60400 x1308791892785337/t0  
o35->4f104403-eb03-83be-2910-2fd7cc260...@net_0x20000c0a84410_uuid:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) @@@ no handle for file close ino  
5606200: cookie 0x5ed7d8c3d129a361  r...@ffff810071b28400  
x1308791892785342/t0  
o35->4f104403-eb03-83be-2910-2fd7cc260...@net_0x20000c0a84410_uuid:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc 0/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(mds_open.c:1665:mds_close()) Skipped 4 previous similar  
messages
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) @@@ processing error  
(-116)  r...@ffff810071b28400 x1308791892785342/t0  
o35->4f104403-eb03-83be-2910-2fd7cc260...@net_0x20000c0a84410_uuid:0/0  
lens 408/864 e 0 to 0 dl 1248927113 ref 1 fl Interpret:/0/0 rc -116/0
Jul 30 06:11:47 BEESPBESXFIL27 kernel: LustreError:  
22061:0:(ldlm_lib.c:1826:target_send_reply_msg()) Skipped 4 previous  
similar messages


Client log:
-----------
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.0...@tcp. The mds_close  
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606195 mdc  
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 1 previous  
similar message
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) inode 5606155 mdc  
close failed: rc = -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(file.c:114:ll_close_inode_openhandle()) Skipped 3 previous  
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: 11-0: an error  
occurred while communicating with 172.16.0...@tcp. The mds_close  
operation failed with -116
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: Skipped 7 previous  
similar messages
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock())  
ASSERTION(lock->l_writers > 0) failed
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError:  
13298:0:(ldlm_lock.c:602:ldlm_lock_decref_internal_nolock()) LBUG
Jul 30 06:11:47 BEESPDESXAPP06 kernel:
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Call Trace:  
<ffffffff88257aea>{:libcfs:lbug_with_loc+122}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8825fe00>{:libcfs:tracefile_init+0}  
<ffffffff8835d566>{:ptlrpc:ldlm_lock_decref_internal_nolock+182}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8838533b>{:ptlrpc:ldlm_process_flock_lock+4139}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff883864ef>{:ptlrpc:ldlm_flock_completion_ast+2111}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8835f4a9>{:ptlrpc:ldlm_lock_enqueue+2169}  
<ffffffff88377ca0>{:ptlrpc:ldlm_cli_enqueue_fini+2624}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff88376fd3>{:ptlrpc:ldlm_prep_elc_req+755}  
<ffffffff8835bc0d>{:ptlrpc:ldlm_lock_create+2541}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8012c668>{default_wake_function+0}  
<ffffffff88379ae2>{:ptlrpc:ldlm_cli_enqueue+1666}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff88523fcf>{:lustre:ll_file_flock+1407}  
<ffffffff88385cb0>{:ptlrpc:ldlm_flock_completion_ast+0}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8019ae2e>{locks_remove_posix+132}  
<ffffffff80147fdc>{bit_waitqueue+56}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff80190241>{flush_old_exec+2729} <ffffffff80186fc1>{__fput+355}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8018455b>{filp_close+84}  
<ffffffff801360b7>{put_files_struct+107}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8013725c>{do_exit+684}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff80137995>{sys_exit_group+0}  
<ffffffff8014083c>{get_signal_to_deliver+1394}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010aecb>{sysret_signal+28} <ffffffff8010a19c>{do_signal+118}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8012c668>{default_wake_function+0}  
<ffffffff8014b227>{do_futex+104}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff801743b2>{sys_mprotect+1742}  
<ffffffff8010aecb>{sysret_signal+28}
Jul 30 06:11:47 BEESPDESXAPP06 kernel:         
<ffffffff8010b14f>{ptregscall_common+103}
Jul 30 06:11:47 BEESPDESXAPP06 kernel: LustreError: dumping log to  
/tmp/lustre-log.1248927107.13298
Jul 30 06:11:47 BEESPDESXAPP06 kernel: Fixing recursive fault but  
reboot is needed!

Then ineed a reboot of the client is required. What does it mean ?  
Could it be related to sys.timeouts and/or ldlm_timeouts too short ?


Regards,


Guillaume Demillecamps

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to