Hi all, I have a weird problem on one of my OSSs, though I've seen it once on the other OSS. Things will be humming along nicely, when suddenly I get lots of messages like this:
Jan 29 15:26:16 venus kernel: Lustre: 898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp Jan 29 15:26:16 venus kernel: Lustre: 1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp In the 50 odd minutes before I picked it up, it produced over 10 million such lines in /var/log/messages. Performance degrade systematically during this time on all clients. On the client node in question, IO is disrupted until I unmount and remount the OSTs. Then the problem goes away for a week or so. The part of the logs where this error starts, is at the end of this mail. Oh, and I can ping/ssh the machine in question from the server in question at the time of the problem. So it doesn't seem to be a general networking problem. A bit of info regarding my setup, in case it has something to do with this: I have a shared MDS/MGS and two OSSs, all on Dell servers. The OSS that's giving me the most headaches, has 8GB RAM, 4 x 4TB OSTs and an Intel 10GB NIC. The other OSS has 4GB RAM, 1 x 4.8TB OSTs and two GB Intel NICs that's bonded using the 802.3ad dynamic link aggregation protocol. I have about 200 clients connecting to this file system. I have another lustre system, comprising of Intel based component servers, that acts as a mirror. This system has been running fine. All the servers are running Centos 5.4 64bit and lustre 1.8.1.1. The clients are running the Suse 11 lustre kernel. So, does anybody know what's going on here? Or have any pointers as to how I can debug this? Any and all help appreciated, Deon /var/log/messages just before the flood starts: Jan 29 14:28:54 venus kernel: Lustre: 4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection with 12345-192.168.0...@tcp (192.168.0.99:1023) timed out; the network or node may be down. Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0000: haven't heard from client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0...@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jan 29 14:31:42 venus kernel: Lustre: galaxy-OST0003: haven't heard from client 0afbaa24-aa3c-07d6-5752-10300b3997ba (at 192.168.0...@tcp) in 227 seconds. I think it's dead, and I am evicting it. Jan 29 14:31:42 venus kernel: Lustre: Skipped 1 previous similar message Jan 29 14:38:45 venus kernel: Lustre: 4254:0:(socklnd_cb.c:2173:ksocknal_find_timed_out_conn()) A connection with 12345-192.168.1...@tcp (192.168.1.26:1023) timed out; the network or node may be d own. Jan 29 14:40:50 venus kernel: Lustre: 4251:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 192.168.0.16/1023 -> 192.168.1.26/988 Jan 29 14:40:50 venus kernel: Lustre: 4251:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.1...@tcp at host 192.168.1.26 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Jan 29 14:40:50 venus kernel: Lustre: 4251:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:40:54 venus kernel: Lustre: 1090:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1325271916821767 sent from galaxy-OST0001 to NID 192.168.1...@tcp 7s ago has timed out (limit 7s). Jan 29 14:40:54 venus kernel: r...@ffff81016c500000 x1325271916821767/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768854 ref 1 fl Rpc:/0/0 rc 0/0 Jan 29 14:40:57 venus kernel: Lustre: 4253:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 192.168.0.16/1023 -> 192.168.1.26/988 Jan 29 14:40:57 venus kernel: Lustre: 4253:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.1...@tcp at host 192.168.1.26 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Jan 29 14:40:57 venus kernel: Lustre: 4253:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:40:59 venus kernel: Lustre: 898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp Jan 29 14:40:59 venus kernel: LustreError: 898:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 r...@ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 126476 8866 ref 2 fl Rpc:/0/0 rc 0/0 Jan 29 14:40:59 venus kernel: LustreError: 898:0:(events.c:66:request_out_callback()) Skipped 3929552 previous similar messages Jan 29 14:40:59 venus kernel: Lustre: 898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1...@tcp 0s ago has failed due to netw ork error (limit 7s). Jan 29 14:40:59 venus kernel: r...@ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866 ref 1 fl Rpc:/0/0 rc 0/0 Jan 29 14:40:59 venus kernel: Lustre: 898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp Jan 29 14:40:59 venus last message repeated 648 times Jan 29 14:41:02 venus kernel: Lustre: 4252:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 192.168.0.16/1023 -> 192.168.1.26/988 Jan 29 14:41:02 venus kernel: Lustre: 4252:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.1...@tcp at host 192.168.1.26 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Jan 29 14:41:02 venus kernel: Lustre: 4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:41:02 venus kernel: Lustre: 4252:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:41:06 venus kernel: Lustre: 898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1...@tcp 7s ago has timed out (limit 7 s). Jan 29 14:41:06 venus kernel: r...@ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768866 ref 1 fl Rpc:/2/0 rc 0/0 Jan 29 14:41:06 venus kernel: Lustre: 898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 650 previous similar messages Jan 29 14:41:09 venus kernel: Lustre: 4250:0:(linux-tcpip.c:688:libcfs_sock_connect()) Error -113 connecting 192.168.0.16/1023 -> 192.168.1.26/988 Jan 29 14:41:09 venus kernel: Lustre: 4250:0:(acceptor.c:95:lnet_connect_console_error()) Connection to 192.168.1...@tcp at host 192.168.1.26 was unreachable: the network or that node may be down, or Lustre may be misconfigured. Jan 29 14:41:09 venus kernel: Lustre: 4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:41:09 venus kernel: Lustre: 4250:0:(socklnd_cb.c:421:ksocknal_txlist_done()) Deleting packet type 1 len 296 192.168.0...@tcp->192.168.1...@tcp Jan 29 14:41:13 venus kernel: Lustre: 898:0:(client.c:1383:ptlrpc_expire_one_request()) @@@ Request x1325271916822084 sent from galaxy-OST0002 to NID 192.168.1...@tcp 7s ago has timed out (limit 7 s). Jan 29 14:41:13 venus kernel: r...@ffff8102b94ff000 x1325271916822084/t0 o106->@:15/16 lens 296/424 e 0 to 1 dl 1264768873 ref 1 fl Rpc:/2/0 rc 0/0 Jan 29 14:41:13 venus kernel: Lustre: 898:0:(client.c:1383:ptlrpc_expire_one_request()) Skipped 1 previous similar message Jan 29 14:41:13 venus kernel: Lustre: 898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp Jan 29 14:41:15 venus last message repeated 84370 times Jan 29 14:41:15 venus kernel: Lustre: 1090:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp Jan 29 14:41:15 venus kernel: Lustre: 898:0:(socklnd_cb.c:915:ksocknal_launch_packet()) No usable routes to 12345-192.168.1...@tcp -- Deon Borman IT Supervisor BlackGinger -- _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
