With 1.8.3 clients and 1.8.3 OSSs, a couple of my nodes seem to have lost connection to an OSS. If I do lfs df, I get the following:
lustre_0-OST0028_UUID: Resource temporarily unavailable lustre_0-OST0029_UUID: Resource temporarily unavailable lustre_0-OST002a_UUID: Resource temporarily unavailable lustre_0-OST002b_UUID: Resource temporarily unavailable lustre_0-OST002c_UUID 6486115712 3882764932 2603348732 59% /mnt/lustre_0[OST:44] lustre_0-OST002d_UUID 6486115712 3797895540 2688209196 58% /mnt/lustre_0[OST:45] lustre_0-OST002e_UUID 6486115712 3717364684 2768740788 57% /mnt/lustre_0[OST:46] lustre_0-OST002f_UUID 6486115712 3535928996 2950180572 54% /mnt/lustre_0[OST:47] This has happened on several machines. Rebooting them seems to cure it. There are a large number of error messages in the logs - eg: Jul 7 18:22:14 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774596107 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4....@tcp 21s ago has timed out (21s prior to deadline). Jul 7 18:22:14 cn458 kernel: r...@ffff8100841ed000 x1340150774596107/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1278523334 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 7 18:22:14 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 52 previous similar messages Jul 7 18:23:06 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) lustre_0-OST0004-osc-ffff81021f55a400: tried all connections, increasing latency to 19s Jul 7 18:23:06 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) Skipped 58 previous similar messages Jul 7 18:26:48 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774596722 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4....@tcp 30s ago has timed out (30s prior to deadline). Jul 7 18:26:48 cn458 kernel: r...@ffff8101e00d1800 x1340150774596722/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1278523608 ref 2 fl Rpc:N/0/0 rc 0/0 Jul 7 18:26:48 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) Skipped 95 previous similar messages Jul 7 18:28:22 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) lustre_0-OST0028-osc-ffff81021f55a400: tried all connections, increasing latency to 25s Jul 7 18:28:22 cn458 kernel: Lustre: 3816:0:(import.c:517:import_select_connection()) Skipped 84 previous similar messages Jul 7 18:35:35 cn458 kernel: Lustre: 3815:0:(client.c:1463:ptlrpc_expire_one_request()) @@@ Request x1340150774597865 sent from lustre_0-OST0028-osc-ffff81021f55a400 to NID 10.1.4....@tcp 30s ago has timed out (30s prior to deadline). Jul 7 18:35:35 cn458 kernel: r...@ffff8101d66d6800 x1340150774597865/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1278524135 ref 2 fl Rpc:N/0/0 rc 0/0 Is there a known problem? What information would help debug this? Chris PS clients are on bonded 1GigE, servers 10GigE (if that makes a difference). _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
