I know the subject line isn't the best, but I don't know what to say other then a luster client is acting up while others are fine. This client is our 'file' server. It runs a nfs and samba server on top of the lustre mount.
/etc/fstab 92.168.5.104@tcp0:192.168.5.105@tcp0:/lustre /lustre lustre defaults,localflock,_netdev 0 0 Right now lfs df -h shows all the oss as resource unavailable, yet lctl dl says they are up lctl dl 0 UP mgc MGC192.168.5.104@tcp adc80ed6-e9a1-6791-e3aa-9a699e11275d 5 1 UP lov lustre-clilov-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 4 2 UP mdc lustre-MDT0000-mdc-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 5 3 UP osc lustre-OST0000-osc-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 5 4 UP osc lustre-OST0001-osc-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 5 5 UP osc lustre-OST0002-osc-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 5 6 UP osc lustre-OST0003-osc-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 5 On the cluster, all nodes are connected just fine, so it seems to just be this client. This is what I'm seeing from dmesg: Alot of these messages: LustreError: 4462:0:(llite_nfs.c:96:search_inode_for_lustre()) failure -2 inode 560441703 Then these messages when the 'disconnect' happens Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1353138276259258 sent from lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 7s ago has timed out (7s prior to deadline). req@ffff8101f18cd000 x1353138276259258/t0 o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1 dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to service lustre-OST0003 using nid 192.168.5.101@tcp. Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1353138276259591 sent from lustre-OST0002-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 8s ago has timed out (7s prior to deadline). req@ffff810292c74c00 x1353138276259591/t0 o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1 dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1 previous similar message Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection to service lustre-OST0002 via nid 192.168.5.101@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: Skipped 1 previous similar message Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection restored to service lustre-OST0002 using nid 192.168.5.101@tcp. Lustre: Skipped 1 previous similar message Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1353138276259258 sent from lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 7s ago has timed out (7s prior to deadline). req@ffff8101f18cd000 x1353138276259258/t0 o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1 dl 1297448449 ref 1 fl Rpc:/2/0 rc 0/0 Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress operations using this service will wait for recovery to complete. Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to service lustre-OST0003 using nid 192.168.5.101@tcp. Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1353138276318758 sent from lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 0s ago has failed due to network error (7s prior to deadline). req@ffff810321140800 x1353138276318758/t0 o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1 dl 1297448467 ref 1 fl Rpc:/0/0 rc 0/0 Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress operations using this service will wait for recovery to complete. LustreError: 3897:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.5.100 LustreError: 11b-b: Connection to 192.168.5.100@tcp at host 192.168.5.100 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.5.100@tcp one of its NIDs? Lustre: 3904:0:(import.c:517:import_select_connection()) lustre-OST0002-osc-ffff81032f9a0400: tried all connections, increasing latency to 2s Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request x1353138276318825 sent from lustre-OST0000-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 0s ago has failed due to network error (6s prior to deadline). req@ffff8102e8277000 x1353138276318825/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1297448473 ref 1 fl Rpc:N/0/0 rc 0/0 Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 7 previous similar messages LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error -104 reading HELLO from 192.168.5.101 LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped 1 previous similar message LustreError: 11b-b: Connection to 192.168.5.101@tcp at host 192.168.5.101 on port 988 was reset: is it running a compatible version of Lustre and is 192.168.5.101@tcp one of its NIDs? Which now just repeats. How can I get this client reconnected? -- Personally, I liked the university. They gave us money and facilities, we didn't have to produce anything! You've never been out of college! You don't know what it's like out there! I've worked in the private sector. They expect results. -Ray Ghostbusters _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
