This is the strangest problem I have seen. I have a lustre filesystem mounted 
on a linux server and its being exported to various alpha systems. The alphas 
mount it just fine however under heavy load the NFS server stops responding, as 
does the lustre mount on the export server. The weird thing is that if i mount 
the nfs export on another nfs server and run the same benchmark (bonnie) 
everything is fine. The lustre mount on the export server can take a real 
pounding (ive seen it push 300MB/sec) so I don't know why nfs is crashing it.

On the nfs export server i see these messages--


Lustre: 4224:0:(o2iblnd_cb.c:412:kiblnd_handle_rx()) PUT_NACK from [EMAIL 
PROTECTED]
LustreError: 4400:0:(client.c:969:ptlrpc_expire_one_request()) @@@ timeout 
(sent at 1197415542, 100s ago)  [EMAIL PROTECTED] x38827/t0 o36->[EMAIL 
PROTECTED]@o2ib:12 lens 14256/672 ref 1 fl Rpc:/0/0 rc 0/-22
Lustre: data-MDT0000-mdc-ffff81082d702000: Connection to service data-MDT0000 
via nid [EMAIL PROTECTED] was lost; in progress operations using this service
will wait for recovery to complete.

A trace of the hung nfs deamons revels the following--

Dec 11 18:46:33 cpu3 kernel: nfsd          S ffff8108246ff008     0  4729      
1          4730  4728 (L-TLB)
Dec 11 18:46:33 cpu3 kernel:  ffff81082be0daa0 0000000000000046 
ffff810824710740 000064b0886cfdc4
Dec 11 18:46:33 cpu3 kernel:  0000000000000009 ffff81082fc6f7e0 
ffffffff802dcae0 000000814fbeae1f
Dec 11 18:46:33 cpu3 kernel:  0000000003d51554 ffff81082fc6f9c8 
0000000000000000 ffff8108246ff000
Dec 11 18:46:33 cpu3 kernel: Call Trace:
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80061839>] schedule_timeout+0x8a/0xad
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80092b26>] process_timeout+0x0/0x5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88700a3d>] 
:ptlrpc:ptlrpc_queue_wait+0xa9d/0x1250
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff886d67a1>] 
:ptlrpc:ldlm_resource_putref+0x331/0x3b0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8870a2c5>] 
:ptlrpc:lustre_msg_set_flags+0x45/0x120
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800884f8>] default_wake_function+0x0/0xe
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a37d0>] :mdc:mdc_reint+0xc0/0x240
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a5c77>] 
:mdc:mdc_unlink_pack+0x117/0x140
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff887a4ab7>] :mdc:mdc_unlink+0x307/0x3d0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff801405f7>] __next_cpu+0x19/0x28
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80087090>] 
find_busiest_group+0x20d/0x621
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80009499>] __d_lookup+0xb0/0xff
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8886ced6>] :lustre:ll_unlink+0x1d6/0x370
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8883b791>] 
:lustre:ll_inode_permission+0xa1/0xc0
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff80047fc8>] vfs_unlink+0xc2/0x108
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857c57a>] :nfsd:nfsd_unlink+0x1de/0x24b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88583e9a>] 
:nfsd:nfsd3_proc_remove+0xa8/0xb5
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885791c4>] 
:nfsd:nfsd_dispatch+0xd7/0x198
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff88488514>] 
:sunrpc:svc_process+0x44d/0x70b
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff800625bf>] __down_read+0x12/0x92
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff885796fb>] :nfsd:nfsd+0x1ae/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfb1>] child_rip+0xa/0x11
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8857954d>] :nfsd:nfsd+0x0/0x2db
Dec 11 18:46:33 cpu3 kernel:  [<ffffffff8005bfa7>] child_rip+0x0/0x11

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Reply via email to