Hello,

I am not sure of how to further debug this problem. Here is the situation.

Feb  6 15:58:00 nar2 kernel: LustreError: 
2939:0:(client.c:442:ptlrpc_check_status()) @@@ type == PTL_RPC_MSG_ERR, err == 
-107 [EMAIL PROTECTED] x3573956/t0 o400->[EMAIL PROTECTED]:28 lens 64/64 ref 1 
fl Rpc:RN/0/0 rc 0/-107
Feb  6 15:58:00 nar2 kernel: LustreError: Connection to service nar-sfs-ost103 
via nid 0:3712500356 was lost; in progress operations using this service will 
wait for recovery to complete.
Feb  6 15:58:00 nar2 kernel: Lustre: 
2939:0:(import.c:139:ptlrpc_set_import_discon()) 
OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection lost to [EMAIL PROTECTED]
Feb  6 15:58:00 nar2 kernel: Lustre: 
2939:0:(import.c:288:import_select_connection()) 
OSC_nar2_nar-sfs-ost103_MNT_client_gm: Using connectionNID_3712497273_UUID
Feb  6 15:58:00 nar2 kernel: Lustre: 
2939:0:(import.c:288:import_select_connection()) skipped 2 similar messages 
(ending 354162.872 seconds ago)
Feb  6 15:58:11 nar2 kernel: LustreError: This client was evicted by 
nar-sfs-ost103; in progress operations using this service will be reattempted.
Feb  6 15:58:11 nar2 kernel: LustreError: 
5652:0:(ldlm_resource.c:361:ldlm_namespace_cleanup()) Namespace 
OSC_nar2_nar-sfs-ost103_MNT_client_gm resource refcount 4 after lock cleanup
Feb  6 15:58:11 nar2 kernel: LustreError: 
5646:0:(llite_mmap.c:208:ll_tree_unlock()) couldn't unlock -5
Feb  6 15:58:11 nar2 kernel: Lustre: Connection restored to service 
nar-sfs-ost103 using nid 0:3712500356.
Feb  6 15:58:11 nar2 kernel: Lustre: 
5652:0:(import.c:692:ptlrpc_import_recovery_state_machine()) 
OSC_nar2_nar-sfs-ost103_MNT_client_gm: connection restored to [EMAIL PROTECTED]

[EMAIL PROTECTED] ~]# ps -lfu dgerbasi
F S UID        PID  PPID  C PRI  NI ADDR SZ WCHAN  STIME TTY          TIME CMD
5 S dgerbasi  5642  5639  0  76   0 -  8211 -      Feb06 ?        00:01:21 
slurmd: [164600.1]
0 D dgerbasi  5643  5642  0  76   0 -  6473 lock_p Feb06 ?        00:00:00 
/nar_sfs/dgerbasi/programs/siesta/water/./siesta
0 D dgerbasi  5644  5642  0  76   0 -  6473 wait_o Feb06 ?        00:00:00 
/nar_sfs/dgerbasi/programs/siesta/water/./siesta
0 D dgerbasi  5645  5642  0  76   0 -  6473 lock_p Feb06 ?        00:00:00 
/nar_sfs/dgerbasi/programs/siesta/water/./siesta


there is a file that seems to be "locked" on the problem node:
[EMAIL PROTECTED] water]# ls -al O.POT.CONF
-rw-r--r--  1 dgerbasi twoo 62275 Feb  6 15:58 O.POT.CONF

[EMAIL PROTECTED] water]# stat O.POT.CONF
  File: `O.POT.CONF'
  Size: 62275           Blocks: 128        IO Block: 2097152 regular file
Device: f908b518h/-116869864d   Inode: 1441916     Links: 1
Access: (0644/-rw-r--r--)  Uid: (130408/dgerbasi)   Gid: (130023/    twoo)
Access: 2007-02-06 15:57:05.507257800 -0500
Modify: 2007-02-06 15:58:11.711026503 -0500
Change: 2007-02-06 15:58:11.711026503 -0500

[EMAIL PROTECTED] water]# file O.POT.CONF
...hangs

while on another node, it is fine but an older version:
[EMAIL PROTECTED] water]# ls -al O.POT.CONF
-rw-r--r--  1 dgerbasi twoo 62275 Feb  6 15:57 O.POT.CONF

[EMAIL PROTECTED] water]# stat O.POT.CONF
  File: `O.POT.CONF'
  Size: 62275           Blocks: 128        IO Block: 2097152 regular file
Device: f908b518h/-116869864d   Inode: 1441916     Links: 1
Access: (0644/-rw-r--r--)  Uid: (130408/dgerbasi)   Gid: (130023/    twoo)
Access: 2007-02-07 10:50:21.299914088 -0500
Modify: 2007-02-06 15:57:05.000000000 -0500
Change: 2007-02-06 15:57:05.000000000 -0500

[EMAIL PROTECTED] water]# file O.POT.CONF
O.POT.CONF: ASCII text


So it seems, the problem node (nar2) lost connectivity to the ost103 service, got evicted, then reconnected ("lfs check servers" now shows active). It is now waiting on some type of page lock to be released before it can update the file to disk ("lfs getstripe O.POT.CONF" confirmed file is on ost103).

Any advise or points in the right direction would be appreciated. We see this pop up every now and then on our nodes with a variety of user codes.

thanks
-k

This is a HP SFS system (based on lustre 1.4.2)
client version:
[EMAIL PROTECTED] water]# cat /proc/fs/lustre/version
1.4.2-20051219152732-CHANGED-.bld.PER_RC3_2.xc.src.lustre........obj.x86_64.kernel-2.6.9.linux-2.6.9-2.6.9-22.7hp.XCsmp

server version:
V2.1-0 (build nlH8hp, 2005-12-22) (filesystem 1.4.2)

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to