Hi,

I have a small problem but it certainly is the fault of the little knowledge I have by the argument. I have a Lustre file system with a node MGS/MDS, two nodes OSS and one Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File System.
The copy process is put in the state -stalled- and when the node OSS is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:

May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 17s ago has timed out (17s prior to deadline). May 19 13:43:43 mdt01prdpom kernel: r...@ffff81012e11e400 x1336168048230433/t0 o400->[email protected]@tcp: 28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0 May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection to service lustre01-OST0000 via nid 172.16.100....@tcp was lost; in progress operations using this service will wait for recovery to complete. May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 26s ago has timed out (26s prior to deadline). May 19 13:44:09 mdt01prdpom kernel: r...@ffff81012e5f2000 x1336168048230435/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0 May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 517:import_select_connection()) lustre01-OST0000-osc: tried all connections, increasing latency to 2s May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113 May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 66:request_out_callback()) @@@ type 4, status -113 r...@ffff81012d3e5800 x1336168048230437/t0 o8->[email protected] @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc 0/0 May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has failed due to network error (27s prior to deadline). May 19 13:44:37 mdt01prdpom kernel: r...@ffff81012d3e5800 x1336168048230437/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0 May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 517:import_select_connection()) lustre01-OST0000-osc: tried all connections, increasing latency to 3s May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c: 2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113 May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c: 66:request_out_callback()) @@@ type 4, status -113 r...@ffff81012e11e400 x1336168048230441/t0 o8->[email protected] @tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc 0/0 May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c: 1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has failed due to network error (28s prior to deadline). May 19 13:45:33 mdt01prdpom kernel: r...@ffff81012e11e400 x1336168048230441/t0 o8->[email protected]@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0 May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c: 517:import_select_connection()) lustre01-OST0000-osc: tried all connections, increasing latency to 4s May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client was evicted by lustre01-OST0000; in progress operations using this service will fail. May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c: 1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota recovery May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc: Connection restored to service lustre01-OST0000 using nid 172.16.100....@tcp. May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000: lustre01-OST0000_UUID now active, resetting orphans

is a timeout problem ??
How can I change the timeout ?

Thanks !!!



Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma

cell. 3466147165
tel.  0657060500
email:[email protected]

"Ai sensi e per effetti della legge sulla tutela della riservatezza personale (D.lgs n. 196/2003), questa @mail e' destinata unicamente alle persone sopra indicate e le informazioni in essa contenute sono da considerarsi strettamente riservate. E' proibito leggere, copiare, usare o diffondere il contenuto della presente @mail senza autorizzazione. Se avete ricevuto questo messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to