Good Morning Folks,
        We're (seemingly suddenly) getting some fairly odd IO pauses of about 
20-30 seconds during client writes into one of our file systems (specifically 
an rsync from an NFS to a Lustre).  On the client, we're seeing blocks similar 
to the following when the pause occurs:
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff880080ec4000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1819:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff880034c72000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -113, desc 
ffff8803c6658000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff8805a283e000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff8805b1b0e000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff8805ca086000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff88054b762000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff8805ae49c000
Nov  8 09:19:50 lrc-xfer.scs00 lrc-xfer kernel: LustreError: 
1809:0:(events.c:198:client_bulk_callback()) event type 0, status -5, desc 
ffff88045cb74000

On the OSS, we can see (note: 10.0.2.8 is the client in question):

Nov  8 09:21:18 n0002.lustre LustreError: 
8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from 
12345-10.0.2.8@tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192, 
last alive is 1 secs ago 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 
8731:0:(socklnd.c:1671:ksocknal_destroy_conn()) Completing partial receive from 
12345-10.0.2.8@tcp[2], ip 10.0.2.8:1021, with error, wanted: 8192, left: 8192, 
last alive is 1 secs ago 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 
8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc 
ffff8103be200000 
Nov  8 09:21:18 n0002.lustre LustreError: 
8731:0:(events.c:381:server_bulk_callback()) event type 2, status -5, desc 
ffff8103be200000 
Nov  8 09:21:18 n0002.lustre LustreError: 
9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(1048576)  req@ffff8104178a6c00 x1412852387822649/t0 
o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl 
1352395330 ref 1 fl Interpret:/0/0 rc 0/0 
Nov  8 09:21:18 n0002.lustre kernel: LustreError: 
9141:0:(ost_handler.c:1073:ost_brw_write()) @@@ network error on bulk GET 
0(1048576)  req@ffff8104178a6c00 x1412852387822649/t0 
o4->81cf6d57-d07f-6bef-2fef-ca8a980c718e@:0/0 lens 448/416 e 1 to 0 dl 
1352395330 ref 1 fl Interpret:/0/0 rc 0/0 
Nov  8 09:21:18 n0002.lustre Lustre: 
9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm 
error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8@tcp - client 
will retry 
Nov  8 09:21:18 n0002.lustre kernel: Lustre: 
9141:0:(ost_handler.c:1224:ost_brw_write()) lrc-OST0009: ignoring bulk IO comm 
error with 81cf6d57-d07f-6bef-2fef-ca8a980c718e@ id 12345-10.0.2.8@tcp - client 
will retry 
Nov  8 09:21:24 n0002.lustre Lustre: 
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004: 
81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting 
Nov  8 09:21:24 n0002.lustre Lustre: 
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar 
messages 
Nov  8 09:21:24 n0002.lustre kernel: Lustre: 
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) lrc-OST0004: 
81cf6d57-d07f-6bef-2fef-ca8a980c718e reconnecting 
Nov  8 09:21:24 n0002.lustre kernel: Lustre: 
8978:0:(ldlm_lib.c:574:target_handle_reconnect()) Skipped 5 previous similar 
messages 

Any ideas as to a cause?  Is this network loss?
----------------
John White
HPC Systems Engineer
(510) 486-7307
One Cyclotron Rd, MS: 50C-3209C
Lawrence Berkeley National Lab
Berkeley, CA 94720

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to