Brian J. Murrell wrote: > On Fri, 2009-03-06 at 10:45 +0100, Thomas Roth wrote: >> Thanks Brian. > > NP. > >> What I meant: the average batch job that wants to read from or write to >> Lustre will abort if a file cannot be accessed. The reason doesn't >> matter to the jobs or the user. > > That may be so, but what I am saying is that when a lustre client wants > to perform an i/o operation on behalf of an application running on that > machine and the target it wants to do the i/o with is down, the lustre > client will wait and block the applications i/o indefinitely. > > That means that unless the application has some kind of timer in it so > that it can abort the read(2)/write(2), it will wait forever as the > read(2) or write(2) system call that it issued will simply wait for the > lustre client to complete -- forever, if the target that the lustre > client wanted to do the i/o with never comes back.
But this is not what our users observe. Even on an otherwise perfectly working system, they report I/O errors on access to some files. This is distributed randomly over time, servers, clients and jobs, and the files are always there and undamaged when checking them afterwards. I can usually see something happening in the logs of OST and client: The OST starts with "timeout on bulk PUT after 6+0s", which the OST is first "ignoring bulk IO comm error" in the hope that "client will retry". But then the OST says "Request x49132126 took longer than estimated (7+5s); client may timeout." Of course I might be falsely connecting error messages that are independent of each other. The client has a log that is easier to read, "Request ... has timed out (limit 7s)", "Connection to service was lost; in progress operations using this service will fail", finally "Connection restored to service". I guess I should append the full lines to this mail. Anyhow, in these minutes the user's job reports "SysError: error reading from file..." - and aborts. The network connection between client and OSS can be trusted to be extremely buggy, but interruptions should only be on the order of seconds. So if the client were just blocking the applications I/O, it should smoothly get over these? This restoration of connection happens 1 sec after the first error message, so both sides can see each other again immediately. ------------------------------------------- ------------------------------------------- kern.log on the OSS: ------------------------------------------- Mar 6 11:23:49 OSS100 kernel: [14066606.171264] LustreError: 15145:0:(ost_handler.c:868:ost_brw_read()) @@@ timeout on bulk PUT after 6+0s r...@ffff81021fa8d938 x93767754/t0 o3->a4affbd5-7a9c-218a-b01f-5d9a73e6e...@net_0x200008cb56e9e_uuid:0/0 lens 384/336 e 0 to 0 dl 1236335029 ref 1 fl Interpret:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.615287] Lustre: 14771:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027: 1011add7-13d1-0efc-65e3-14f94eab88c4 reconnecting Mar 6 11:23:50 OSS100 kernel: [14066607.616000] Lustre: 15112:0:(ost_handler.c:1270:ost_brw_write()) gsilust-OST0027: ignor ing bulk IO comm error with 6bb22b4d-3509-6c5d-ab95-85377cf88...@net_0x200008cb56a8a_uuid id 12345-clien...@tcp - client will retry Mar 6 11:23:50 OSS100 kernel: [14066607.616105] Lustre: 15112:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request x98164002 took longer than estimated (7+4s); client may timeout. r...@ffff8101557cbad0 x98164002/t0 o4->6bb22b4d-3509-6c5d-a b95-85377cf88...@net_0x200008cb56a8a_uuid:0/0 lens 384/352 e 0 to 0 dl 1236335026 ref 1 fl Complete:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.616492] LustreError: 21161:0:(service.c:568:ptlrpc_check_req()) @@@ DROPPING req fr om old connection 1 < 2 r...@ffff81015fc066e0 x520348/t0 o13->gsilust-mdtlov_u...@net_0x200008cb57036_uuid:0/0 lens 128/0 e 0 to 0 dl 0 ref 1 fl New:/0/0 rc 0/0 Mar 6 11:23:50 OSS100 kernel: [14066607.616591] Lustre: 15145:0:(ost_handler.c:925:ost_brw_read()) gsilust-OST0027: ignorin g bulk IO comm error with a4affbd5-7a9c-218a-b01f-5d9a73e6e...@net_0x200008cb56e9e_uuid id 12345-clien...@tcp - client will retry Mar 6 11:23:50 OSS100 kernel: [14066607.623479] Lustre: 15094:0:(service.c:1064:ptlrpc_server_handle_request()) @@@ Request x49132126 took longer than estimated (7+5s); client may timeout. r...@ffff810152edd2b0 x49132126/t0 o3->fe639162-9537-9ab5-3 6f5-fba5c0a43...@net_0x200008cb572f6_uuid:0/0 lens 384/336 e 0 to 0 dl 1236335025 ref 1 fl Complete:/0/0 rc 0/0 Mar 6 11:24:24 OSS100 kernel: [14066641.871733] Lustre: 14753:0:(ldlm_lib.c:525:target_handle_reconnect()) gsilust-OST0027: a4affbd5-7a9c-218a-b01f-5d9a73e6e3c3 reconnecting ------------------------------------------- ------------------------------------------- kern.log on CLIENT_A: ------------------------------------------- Mar 6 11:24:24 CLIENT_A kernel: Lustre: Request x93767754 sent from gsilust-OST0027-osc-ffff810808e68c00 to NID oss...@tcp 41s ago has timed out (limit 7s). Mar 6 11:24:24 CLIENT_A kernel: LustreError: 166-1: gsilust-OST0027-osc-ffff810808e68c00: Connection to service gsilust-OST0027 via nid oss...@tcp was lost; in progress operations using this service will fail. Mar 6 11:24:24 CLIENT_A kernel: LustreError: 167-0: This client was evicted by gsilust-OST0027; in progress operations using this service will fail. Mar 6 11:24:25 CLIENT_A kernel: Lustre: gsilust-OST0027-osc-ffff810808e68c00: Connection restored to service gsilust-OST0027 u sing nid oss...@tcp. -------------------------------------------------------------------------- Regards, Thomas _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
