Ulrich Sibiller schrieb: > Hi, > > we are experiencing some weird behaviour on one of our Lustre clients. > > First some information about our environment: > - Lustre 1.8.1.1 CentOS52 (Kernel 2.6.18-92.1.13.el5) > - self-compiled patchless clients with quotas enabled (quotas not used at the > moment) > - Infiniband interconnect, OFED 1.3.3 > - OSS: 2x Sun Fire X4540 with 48TB, with official sun kernel > 2.6.18-128.7.1.el5_lustre.1.8.1.1, 4 > OSTs on > each of them, OFED 1.3.3 > - MDS: 2x Sun Fire X4100 M2 + 1x StorEdge 3320 with heartbeat failover, > official sun kernel > 2.6.18-128.7.1.el5_lustre.1.8.1.1, OFED 1.4.2 > - Lustre mounted on /hpcscr > - OSS1 is 192.168.60....@o2ib, hostname is hpc9oss1 > - OSS2 is 192.168.60....@o2ib, hostname is hpc9oss2 > - MDS1 is 192.168.60....@o2ib, hostname is hpc9mds1 (active) > - MDS2 is 192.168.60....@o2ib, hostname is hpc9mds2 (standby, was active for > a short time while > mds1 was lifted from 1.6.7.1 to 1.8.1.1) > - problematic client is 192.168.60....@o2ib, hostname hpc9master02 > - no problems on the Infiniband > > Problem: > Users report a slow Lustre filesystem on this particular machine > (hpc9master02). Running "find > /hpcscr -ls" gets stuck after some time and most of the time it continues > after some seconds, but > sometimes it takes several minutes and sometimes I get errors (one "I/O > error", then several "Cannot > send after transport endpoint shutdown") and the find terminates. The IB > error counters do not change > during this test.
I discovered that this problem only arises when this client uses the infiniband connection (o2ib). IB port counters do not increase anywhere. Running over ethernet works perfectly. I am now running Lustre 1.8.2 on the client and on all Lustre servers, I exchanged the client IB cable, use a different switch port and the client's other HCA port but it still does not work. With 1.8.2 I constantly see error -113 on the client and all Lustre servers: Feb 10 13:30:38 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) cfd1-OST0000-osc-ffff812025f37c00: tried all connections, increasing latency to 3s Feb 10 13:30:38 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) Skipped 2 previous similar messages Feb 10 13:30:45 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) cfd1-OST0002-osc-ffff812025f37c00: tried all connections, increasing latency to 3s Feb 10 13:30:45 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) Skipped 1 previous similar message Feb 10 13:31:49 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1327292548671897 sent from cfd1-OST0007-osc-ffff812025f37c00 to NID 192.168.60....@o2ib 44s ago has timed out (44s prior to deadline). Feb 10 13:31:49 hpc9master02 kernel: r...@ffff811fbb893000 x1327292548671897/t0 o101->[email protected]@o2ib:28/4 lens 296/544 e 0 to 1 dl 1265805109 ref 1 fl Rpc:/0/0 rc 0/0 Feb 10 13:31:49 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 12 previous similar messages Feb 10 13:31:49 hpc9master02 kernel: Lustre: cfd1-OST0007-osc-ffff812025f37c00: Connection to service cfd1-OST0007 via nid 192.168.60....@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 13:31:49 hpc9master02 kernel: Lustre: Skipped 4 previous similar messages Feb 10 13:31:49 hpc9master02 kernel: Lustre: cfd1-OST0007-osc-ffff812025f37c00: Connection restored to service cfd1-OST0007 using nid 192.168.60....@o2ib. Feb 10 13:31:49 hpc9master02 kernel: Lustre: Skipped 4 previous similar messages Feb 10 13:32:27 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff812025f37c00: Connection to service cfd1-MDT0000 via nid 192.168.60....@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 13:32:27 hpc9master02 kernel: Lustre: Skipped 1 previous similar message Feb 10 13:32:27 hpc9master02 kernel: Lustre: cfd1-MDT0000-mdc-ffff812025f37c00: Connection restored to service cfd1-MDT0000 using nid 192.168.60....@o2ib. Feb 10 13:32:27 hpc9master02 kernel: Lustre: Skipped 1 previous similar message Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-192.168.60....@o2ib: -113 Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(lib-move.c:2436:LNetPut()) Skipped 1 previous similar message Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 r...@ffff811ff264c000 x1327292548700295/t0 o400->[email protected]@o2ib:26/25 lens 192/384 e 0 to 1 dl 1265805221 ref 2 fl Rpc:N/0/0 rc 0/0 Feb 10 13:33:24 hpc9master02 kernel: LustreError: 4475:0:(events.c:66:request_out_callback()) Skipped 1 previous similar message Feb 10 13:33:24 hpc9master02 kernel: LustreError: 166-1: mgc192.168.60....@o2ib: Connection to service MGS via nid 192.168.60....@o2ib was lost; in progress operations using this service will fail. Feb 10 13:33:31 hpc9master02 kernel: Lustre: cfd1-OST0002-osc-ffff812025f37c00: Connection restored to service cfd1-OST0002 using nid 192.168.60....@o2ib. Feb 10 13:33:32 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) cfd1-MDT0000-mdc-ffff812025f37c00: tried all connections, increasing latency to 2s Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(lib-move.c:2436:LNetPut()) Error sending PUT to 12345-192.168.60....@o2ib: -113 Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(lib-move.c:2436:LNetPut()) Skipped 3 previous similar messages Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(events.c:66:request_out_callback()) @@@ type 4, status -113 r...@ffff811b7da91800 x1327292548700311/t0 o250->[email protected]@o2ib:26/25 lens 368/584 e 0 to 1 dl 1265805218 ref 2 fl Rpc:N/0/0 rc 0/0 Feb 10 13:33:32 hpc9master02 kernel: LustreError: 4476:0:(events.c:66:request_out_callback()) Skipped 3 previous similar messages Feb 10 13:33:32 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) Skipped 1 previous similar message Feb 10 13:33:39 hpc9master02 kernel: Lustre: mgc192.168.60....@o2ib: Reactivating import Feb 10 13:33:40 hpc9master02 kernel: Lustre: 4477:0:(import.c:524:import_select_connection()) cfd1-MDT0000-mdc-ffff812025f37c00: tried all connections, increasing latency to 3s Feb 10 13:34:40 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) @@@ Request x1327292548717600 sent from cfd1-OST0002-osc-ffff812025f37c00 to NID 192.168.60....@o2ib 46s ago has timed out (46s prior to deadline). Feb 10 13:34:40 hpc9master02 kernel: r...@ffff811b7d81a000 x1327292548717600/t0 o101->[email protected]@o2ib:28/4 lens 296/544 e 0 to 1 dl 1265805280 ref 1 fl Rpc:/0/0 rc 0/0 Feb 10 13:34:40 hpc9master02 kernel: Lustre: 4609:0:(client.c:1434:ptlrpc_expire_one_request()) Skipped 13 previous similar messages Feb 10 13:34:40 hpc9master02 kernel: Lustre: cfd1-OST0002-osc-ffff812025f37c00: Connection to service cfd1-OST0002 via nid 192.168.60....@o2ib was lost; in progress operations using this service will wait for recovery to complete. Feb 10 13:34:40 hpc9master02 kernel: Lustre: Skipped 2 previous similar messages According to r...@hpc9master02 network-scripts # find /usr/include -name "errno*" | xargs grep -E "\<113\>" /usr/include/asm-generic/errno.h:#define EHOSTUNREACH 113 /* No route to host */ this error means "no route to host". How can this happen? Uli -- __________________________________creating IT solutions Dipl.-Inf. Ulrich Sibiller science + computing ag System Administration Hagellocher Weg 73 fax +49 7071 9457 411 72070 Tuebingen, Germany teamline +49 7071 9457 674 www.science-computing.de -- Vorstand/Board of Management: Dr. Bernd Finkbeiner, Dr. Roland Niemeier, Dr. Arno Steitz, Dr. Ingrid Zech Vorsitzender des Aufsichtsrats/ Chairman of the Supervisory Board: Michel Lepert Sitz/Registered Office: Tuebingen Registergericht/Registration Court: Stuttgart Registernummer/Commercial Register No.: HRB 382196 _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
