More important is to include the crash message from the client and the
version of Lustre you are using.
Cheers, Andreas
On 2010-05-19, at 6:34, Stefano Elmopi <[email protected]>
wrote:
Hi,
I have a small problem but it certainly is the fault of the little
knowledge I have by the argument.
I have a Lustre file system with a node MGS/MDS, two nodes OSS and
one Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File
System.
The copy process is put in the state -stalled- and when the node OSS
is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:
May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433 sent
from lustre01-OST0000-osc to NID 172.16.100....@tcp 17s ago has
timed out (17s prior to deadline).
May 19 13:43:43 mdt01prdpom kernel: r...@ffff81012e11e400
x1336168048230433/t0 o400->[email protected]@tcp:
28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100....@tcp
was lost; in progress operations using this service will wait for
recovery to complete.
May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435 sent
from lustre01-OST0000-osc to NID 172.16.100....@tcp 26s ago has
timed out (26s prior to deadline).
May 19 13:44:09 mdt01prdpom kernel: r...@ffff81012e5f2000
x1336168048230435/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 2s
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff81012d3e5800 x1336168048230437/t0 o8->[email protected]
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc
0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437 sent
from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has
failed due to network error (27s prior to deadline).
May 19 13:44:37 mdt01prdpom kernel: r...@ffff81012d3e5800
x1336168048230437/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 3s
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff81012e11e400 x1336168048230441/t0 o8->[email protected]
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc
0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441 sent
from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has
failed due to network error (28s prior to deadline).
May 19 13:45:33 mdt01prdpom kernel: r...@ffff81012e11e400
x1336168048230441/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 4s
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c:
1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota
recovery
May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid 172.16.100.121
@tcp.
May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
is a timeout problem ??
How can I change the timeout ?
Thanks !!!
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:[email protected]
"Ai sensi e per effetti della legge sulla tutela della
riservatezza personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss