Good Morning Folks,
We're (seemingly suddenly) getting some fairly odd IO pauses of about
20-30 seconds during client writes into one of our file systems (specifically
an rsync from an NFS to a Lustre). On the client, we're seeing blocks similar
to the following when the pause occurs:
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff880080ec4000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1819:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff880034c72000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -113, desc
ffff8803c6658000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff8805a283e000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff8805b1b0e000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff8805ca086000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff88054b762000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff8805ae49c000
Nov 8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError:
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc
ffff88045cb74000
On the OSS, we can see (note: 10.0.2.8 is the client in question):
Nov 8 09:21:18 n0002.lustre LustreError:
8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from
12345-10.0.2.8@tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192,
last alive is 1 secs ago
Nov 8 09:21:18 n0002.lustre kernel: LustreError:
8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from
12345-10.0.2.8@tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192,
last alive is 1 secs ago
Nov 8 09:21:18 n0002.lustre kernel: LustreError:
8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc
ffff8103be200000
Nov 8 09:21:18 n0002.lustre LustreError:
8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc
ffff8103be200000
Nov 8 09:21:18 n0002.lustre LustreError:
9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET
0(1048576) req@ffff8104178a6c00 x1412852387822649/t0
o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl
1352395330 ref 1 fl Interpret:/0/0 rc 0/0
Nov 8 09:21:18 n0002.lustre kernel: LustreError:
9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET
0(1048576) req@ffff8104178a6c00 x1412852387822649/t0
o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl
1352395330 ref 1 fl Interpret:/0/0 rc 0/0
Nov 8 09:21:18 n0002.lustre Lustre:
9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm
error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8@tcp - client
will retry
Nov 8 09:21:18 n0002.lustre kernel: Lustre:
9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm
error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8@tcp - client
will retry
Nov 8 09:21:24 n0002.lustre Lustre:
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004:
81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting
Nov 8 09:21:24 n0002.lustre Lustre:
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar
messages
Nov 8 09:21:24 n0002.lustre kernel: Lustre:
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004:
81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting
Nov 8 09:21:24 n0002.lustre kernel: Lustre:
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar
messages
Any ideas as to a cause? Is this network loss?
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss