Hi, We have a 1.6.6 installation using InfiniBand attached DDN OST storage and OSS'es connected to the network with 10GE adapters. When running iozone with ~40 1GE attached clients we see the following on the clients:
Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8100a01c4000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810050164000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81031b920000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81032192a000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81001b20c000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810128406000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff81018c6c2000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff810067fce000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5, desc ffff8102a7c62000 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 r...@ffff81037f08b000 x35161916/t0 o4->[email protected]@tcp:6/4 lens 384/480 e 0 to 100 dl 1236869066 ref 3 fl Rpc:/0/0 rc 0/0 Mar 12 14:42:46 com01-06 kernel: LustreError: 4193:0:(events.c:66:request_out_callback()) Skipped 11 previous similar messages Mar 12 14:42:46 com01-06 kernel: Lustre: Request x35161916 sent from test1-OST0008-osc-ffff810324e8e000 to NID 172.23.125...@tcp 0s ago has timed out (limit 100s). Mar 12 14:42:46 com01-06 kernel: Lustre: Skipped 8 previous similar messages Mar 12 14:42:46 com01-06 kernel: Lustre: test1-OST0008-osc-ffff810324e8e000: Connection to service test1-OST0008 via nid 172.23.125...@tcp was lost; in progress operations using this service will wait for recovery to complete. And this on the OSS: Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000] Error -14 on read from 12345-172.23.98....@tcp ip 172.23.98.133:1021 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial receive from 12345-172.23.98....@tcp, ip 172.23.98.133:1021, with error Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Skipped 4 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff810049430000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 6699:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk GET 0(1048576) r...@ffff8100779 2dc50 x35161902/t0 o4->8ec45cac-9f38-63c9-eb19-b4bad0242...@net_0x20000ac176285_uuid:0/0 lens 384/352 e 0 to 0 dl 1236869066 ref 1 fl Interpret:/0/0 rc 0/0 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 6699:0:(ost_handler.c:1153:ost_brw_write()) Skipped 4 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100528b2000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6680:0:(ost_handler.c:1284:ost_brw_write()) test1-OST0010: ignoring bulk IO comm error with bfb4f76d-1090-a175-89cd-7f51df10c...@net_0x20000ac17628d_uuid id 12345-172.23.98....@tcp - client will retry Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6680:0:(ost_handler.c:1284:ost_brw_write()) Skipped 85 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100633fa000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff81007ea56000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff8100690ea000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError: 5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5, desc ffff810044aa0000 Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:538:target_handle_reconnect()) test1-OST0008: 8ec45cac-9f38-63c9-eb19-b4bad0242b73 reconnecting Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 8 previous similar messages Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:773:target_handle_connect()) test1-OST0008: refuse reconnection from 8ec45cac-9f38-63c9 [email protected]@tcp to 0xffff810023258000; still busy with 12 active RPCs Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre: 6509:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 5 previous similar messages What could explain this behaviour? What is Error 14? Gerd. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
