Hi,

We have a 1.6.6 installation using InfiniBand attached DDN OST storage
and OSS'es connected to the network with 10GE adapters.  When running
iozone with ~40 1GE attached clients we see the following on the clients:

Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff8100a01c4000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810050164000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81031b920000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81032192a000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81001b20c000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810128406000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff81018c6c2000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff810067fce000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:194:client_bulk_callback()) event type 0, status -5,
desc ffff8102a7c62000
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:66:request_out_callback()) @@@ type 4, status -5 
r...@ffff81037f08b000 x35161916/t0
o4->[email protected]@tcp:6/4 lens 384/480 e 0 to 100 dl
1236869066 ref 3 fl Rpc:/0/0 rc 0/0
Mar 12 14:42:46 com01-06 kernel: LustreError:
4193:0:(events.c:66:request_out_callback()) Skipped 11 previous similar
messages
Mar 12 14:42:46 com01-06 kernel: Lustre: Request x35161916 sent from
test1-OST0008-osc-ffff810324e8e000 to NID 172.23.125...@tcp 0s ago has
timed out (limit 100s).
Mar 12 14:42:46 com01-06 kernel: Lustre: Skipped 8 previous similar messages
Mar 12 14:42:46 com01-06 kernel: Lustre:
test1-OST0008-osc-ffff810324e8e000: Connection to service test1-OST0008
via nid 172.23.125...@tcp was lost; in progress operations using this
service will wait for recovery to complete.




And this on the OSS:

Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) [ffff81001f6fc000]
Error -14 on read from 12345-172.23.98....@tcp ip 172.23.98.133:1021
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5469:0:(socklnd_cb.c:1291:ksocknal_process_receive()) Skipped 5 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Completing partial
receive from 12345-172.23.98....@tcp, ip 172.23.98.133:1021, with error
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(socklnd.c:1631:ksocknal_destroy_conn()) Skipped 4 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff810049430000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
6699:0:(ost_handler.c:1153:ost_brw_write()) @@@ network error on bulk
GET 0(1048576)  r...@ffff8100779
2dc50 x35161902/t0
o4->8ec45cac-9f38-63c9-eb19-b4bad0242...@net_0x20000ac176285_uuid:0/0
lens 384/352 e 0 to 0 dl 1236869066 ref 1 fl Interpret:/0/0 rc 0/0
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
6699:0:(ost_handler.c:1153:ost_brw_write()) Skipped 4 previous similar
messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100528b2000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6680:0:(ost_handler.c:1284:ost_brw_write()) test1-OST0010: ignoring bulk
IO comm error with
bfb4f76d-1090-a175-89cd-7f51df10c...@net_0x20000ac17628d_uuid id
12345-172.23.98....@tcp - client will retry
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6680:0:(ost_handler.c:1284:ost_brw_write()) Skipped 85 previous similar
messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100633fa000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff81007ea56000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff8100690ea000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: LustreError:
5481:0:(events.c:372:server_bulk_callback()) event type 2, status -5,
desc ffff810044aa0000
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:538:target_handle_reconnect()) test1-OST0008:
8ec45cac-9f38-63c9-eb19-b4bad0242b73 reconnecting
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:538:target_handle_reconnect()) Skipped 8 previous
similar messages
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:773:target_handle_connect()) test1-OST0008: refuse
reconnection from 8ec45cac-9f38-63c9
[email protected]@tcp to 0xffff810023258000; still busy
with 12 active RPCs
Mar 12 14:42:46 cs04r-sc-oss01-01 kernel: Lustre:
6509:0:(ldlm_lib.c:773:target_handle_connect()) Skipped 5 previous
similar messages



What could explain this behaviour?  What is Error 14?

Gerd.

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to