Hi all,
we're using Lustre 1.8.3 on a gigabit network. We have 4 osts on 2 oss
and a single mgs. We are serving the users' homedirs (mostly small
files) for 64 clients on this network. It now happened for the third
time that the cluster went down: either the oss or the mgs block, and
nobody can access the lustre share anymore.
Looking at the logs, I see lots of connectivity errors:
LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) MGS handle
cmd=250 rc=-16
LustreError: 17792:0:(mgs_handler.c:641:mgs_handle()) Skipped 3
previous similar messages
LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@
processing error (-16) r...@f5ae642c x1344336331331741/t0
o250->6e1c6cb5-564f-49b0-a01e-e7e460542...@net_0x200000affffc1_uuid:0/0 lens
368/264 e 0 to 0 dl 1288258643 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 17792:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped
3 previous similar messages
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1348047450260821 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0...@lo 30s ago
has timed out (30s prior to deadline).
r...@e66d1e00 x1348047450260821/t0
o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl 1288258632
ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19
previous similar messages
Lustre: MGS: haven't heard from client
03b9cdae-66f1-552b-8c7c-94a9499c8dcf (at 10.255.255....@tcp) in 228 seconds. I
think it's dead, and I am evicting it.
LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading
connection request from 10.255.255.199
Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS:
fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting
Lustre: 2936:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 49
previous similar messages
LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error
-104 reading HELLO from 10.255.255.199
LustreError: 2888:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 2
previous similar messages
LustreError: 11b-b: Connection to 10.255.255....@tcp at host
10.255.255.199 on port 988 was reset: is it running a compatible version of
Lustre and is 10.255.255....@tcp one of its NIDs?
Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp
eb3ff200 already connecting
Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse
reconnection from [email protected]@tcp to
0xeb3ff200; still busy with 2 active RPCs
Lustre: 17792:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1
previous similar message
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1348047450261066 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0...@lo 30s ago
has timed out (30s prior to deadline).
r...@ce186400 x1348047450261066/t0
o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259252
ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19
previous similar messages
Lustre: There was an unexpected network error while writing to
10.255.255.221: -110.
Lustre: 20268:0:(ldlm_lib.c:804:target_handle_connect()) MGS: exp
d8637600 already connecting
Lustre: 19656:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse
reconnection from [email protected]@tcp to
0xd8637600; still busy with 1 active RPCs
LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) MGS handle
cmd=250 rc=-16
LustreError: 19656:0:(mgs_handler.c:641:mgs_handle()) Skipped 3
previous similar messages
LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@
processing error (-16) r...@f5ae6c2c x1344329600678410/t0
o250->7ee2fe58-3fab-c39a-8adb-c356d1bdc...@net_0x200000affffd1_uuid:0/0 lens
368/264 e 0 to 0 dl 1288259427 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 19656:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped
3 previous similar messages
Lustre: There was an unexpected network error while writing to
10.255.255.209: -110.
Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS:
fa602b20-b24c-bbcd-7003-b3b9bf702db4 reconnecting
Lustre: 2935:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 55
previous similar messages
LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11
reading HELLO from 10.255.255.191
LustreError: 2890:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1
previous similar message
LustreError: 2893:0:(acceptor.c:455:lnet_acceptor()) Error -11 reading
connection request from 10.255.255.220
eth0: too many iterations (16) in nv_nic_irq.
Lustre: MGS: haven't heard from client
ed155959-9a71-653f-7d3b-0f4982ed020c (at 10.255.255....@tcp) in 222 seconds. I
think it's dead, and I am evicting it.
Lustre: Skipped 3 previous similar messages
Lustre: There was an unexpected network error while writing to
10.255.255.215: -110.
Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse
reconnection from [email protected]@tcp to
0xee70bc00; still busy with 1 active RPCs
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1348047450261319 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0...@lo 30s ago
has timed out (30s prior to deadline).
r...@ec7ffc00 x1348047450261319/t0
o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl 1288259872
ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19
previous similar messages
Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) MGS:
ed155959-9a71-653f-7d3b-0f4982ed020c reconnecting
Lustre: 20268:0:(ldlm_lib.c:575:target_handle_reconnect()) Skipped 53
previous similar messages
LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Error -11
reading HELLO from 10.255.255.217
LustreError: 2889:0:(socklnd_cb.c:1707:ksocknal_recv_hello()) Skipped 1
previous similar message
Lustre: MGS: haven't heard from client
5ccc9289-412d-1e96-a777-f5c3dbcc3899 (at 10.255.255....@tcp) in 227 seconds. I
think it's dead, and I am evicting it.
Lustre: Skipped 3 previous similar messages
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request
x1348047450261564 sent from lustre-MDT0000-mdc-f3aa2e00 to NID 0...@lo 30s ago
has timed out (30s prior to deadline).
r...@eb72ba00 x1348047450261564/t0
o38->[email protected]@tcp:12/10 lens 368/584 e 0 to 1 dl 1288260492
ref 1 fl Rpc:N/0/0 rc 0/0
Lustre: 2895:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 19
previous similar messages
LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue:
-4
LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure
-4 inode 38076417
LustreError: 14286:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue:
-4
LustreError: 14286:0:(file.c:3256:ll_inode_revalidate_fini()) failure
-4 inode 38076417
LustreError: 21507:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue:
-4
LustreError: 21507:0:(file.c:3256:ll_inode_revalidate_fini()) failure
-4 inode 38076417
Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) MGS: refuse
reconnection from [email protected]@tcp to
0xd1681e00; still busy with 1 active RPCs
Lustre: 2935:0:(ldlm_lib.c:875:target_handle_connect()) Skipped 1
previous similar message
LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) MGS handle cmd=250
rc=-16
LustreError: 2935:0:(mgs_handler.c:641:mgs_handle()) Skipped 2 previous
similar messages
LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) @@@
processing error (-16) r...@f55c402c x1344329607901556/t0
o250->ed155959-9a71-653f-7d3b-0f4982ed0...@net_0x200000affffd4_uuid:0/0 lens
368/264 e 0 to 0 dl 1288260660 ref 1 fl Interpret:/0/0 rc -16/0
LustreError: 2935:0:(ldlm_lib.c:1892:target_send_reply_msg()) Skipped 2
previous similar messages
LustreError: 21584:0:(mdc_locks.c:625:mdc_enqueue()) ldlm_cli_enqueue:
-4
LustreError: 21584:0:(file.c:3256:ll_inode_revalidate_fini()) failure
-4 inode 38076417
So (I guess) this seems to be an issue with our network, which appears
under certain conditions (eg. high load). Another hint pointing in this
direction is the high number of RX errors I can see on the ports of the
connecting switch, and that some of the node are resetting their link.
In normal operation, I don't get any of these behaviors.
I suspect that something goes wrong while communicating over the
network, the mgs starts to evict clients and afterwards the whole system
spirals down in a storm of request. At least it didn't get better when I
resetted the switch, so it's not the switch alone. I only managed to fix
it by switching off half of the nodes...
Any thoughts on this? How can I know if your switch (a Netgear GS748T)
is up to the task? What type of backend do you recommend for a cluster
of this size?
regards,
Arne
--
Arne Brutschy
Ph.D. Student Email arne.brutschy(AT)ulb.ac.be
IRIDIA CP 194/6 Web iridia.ulb.ac.be/~abrutschy
Universite' Libre de Bruxelles Tel +32 2 650 2273
Avenue Franklin Roosevelt 50 Fax +32 2 650 2715
1050 Bruxelles, Belgium (Fax at IRIDIA secretary)
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss