I know the subject line isn't the best, but I don't know what to say
other then a luster client is acting up while others are fine. This
client is our 'file' server. It runs a nfs and samba server on top of
the lustre mount.

/etc/fstab
92.168.5.104@tcp0:192.168.5.105@tcp0:/lustre   /lustre lustre
defaults,localflock,_netdev 0 0

Right now lfs df -h shows all the oss as resource unavailable, yet
lctl dl says they are up
lctl dl
  0 UP mgc MGC192.168.5.104@tcp adc80ed6-e9a1-6791-e3aa-9a699e11275d 5
  1 UP lov lustre-clilov-ffff81032f9a0400 db1e9918-482f-063d-1b42-c2c394a4c81b 4
  2 UP mdc lustre-MDT0000-mdc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  3 UP osc lustre-OST0000-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  4 UP osc lustre-OST0001-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  5 UP osc lustre-OST0002-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5
  6 UP osc lustre-OST0003-osc-ffff81032f9a0400
db1e9918-482f-063d-1b42-c2c394a4c81b 5

On the cluster, all nodes are connected just fine, so it seems to just
be this client.  This is what I'm seeing from dmesg:

Alot of these messages:
LustreError: 4462:0:(llite_nfs.c:96:search_inode_for_lustre()) failure
-2 inode 560441703

Then these messages when the 'disconnect' happens

Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259258 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 7s ago
has timed out (7s prior to deadline).
  req@ffff8101f18cd000 x1353138276259258/t0
o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0003 using nid 192.168.5.101@tcp.
Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259591 sent from
lustre-OST0002-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 8s ago
has timed out (7s prior to deadline).
  req@ffff810292c74c00 x1353138276259591/t0
o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448442 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: 24416:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 1
previous similar message
Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection to service
lustre-OST0002 via nid 192.168.5.101@tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: Skipped 1 previous similar message
Lustre: lustre-OST0002-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0002 using nid 192.168.5.101@tcp.
Lustre: Skipped 1 previous similar message
Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276259258 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 7s ago
has timed out (7s prior to deadline).
  req@ffff8101f18cd000 x1353138276259258/t0
o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448449 ref 1 fl Rpc:/2/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress
operations using this service will wait for recovery to complete.
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection restored to
service lustre-OST0003 using nid 192.168.5.101@tcp.
Lustre: 13877:0:(client.c:1476:ptlrpc_expire_one_request()) @@@
Request x1353138276318758 sent from
lustre-OST0003-osc-ffff81032f9a0400 to NID 192.168.5.101@tcp 0s ago
has failed due to network error (7s prior to deadline).
  req@ffff810321140800 x1353138276318758/t0
o101->[email protected]@tcp:28/4 lens 296/544 e 0 to 1
dl 1297448467 ref 1 fl Rpc:/0/0 rc 0/0
Lustre: lustre-OST0003-osc-ffff81032f9a0400: Connection to service
lustre-OST0003 via nid 192.168.5.101@tcp was lost; in progress
operations using this service will wait for recovery to complete.
LustreError: 3897:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.5.100
LustreError: 11b-b: Connection to 192.168.5.100@tcp at host
192.168.5.100 on port 988 was reset: is it running a compatible
version of Lustre and is 192.168.5.100@tcp one of its NIDs?
Lustre: 3904:0:(import.c:517:import_select_connection())
lustre-OST0002-osc-ffff81032f9a0400: tried all connections, increasing
latency to 2s
Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) @@@ Request
x1353138276318825 sent from lustre-OST0000-osc-ffff81032f9a0400 to NID
192.168.5.101@tcp 0s ago has failed due to network error (6s prior to
deadline).
  req@ffff8102e8277000 x1353138276318825/t0
o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1
dl 1297448473 ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 3903:0:(client.c:1476:ptlrpc_expire_one_request()) Skipped 7
previous similar messages
LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Error
-104 reading HELLO from 192.168.5.101
LustreError: 3899:0:(socklnd_cb.c:1714:ksocknal_recv_hello()) Skipped
1 previous similar message
LustreError: 11b-b: Connection to 192.168.5.101@tcp at host
192.168.5.101 on port 988 was reset: is it running a compatible
version of Lustre and is 192.168.5.101@tcp one of its NIDs?

Which now just repeats. How can I get this client reconnected?

-- 
Personally, I liked the university. They gave us money and facilities,
we didn't have to produce anything! You've never been out of college!
You don't know what it's like out there! I've worked in the private
sector. They expect results. -Ray Ghostbusters
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to