Hi Andreas
My version of Lustre 1.8.3
Sorry for my bad English but I used the wrong word, "crash" is not the
right word.
I try to explain better, I start copying a large file on the file system
and while the copy process continues, I reboot the server OSS,
and the copy process enters state "- stalled -".
I expected that once the server back online, the copy process to
resume normal
and complete copy of the file, instead the copy process fault.
Therefore the copy process that goes wrong, Lustre continues to
perform good.
The failure of the copy process is a timeout issue ?
How can I change the timeout ?
Thanks !!!
Cheers, Stefano
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:[email protected]
"Ai sensi e per effetti della legge sulla tutela della riservatezza
personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
Il giorno 19/mag/10, alle ore 17:07, Andreas Dilger ha scritto:
More important is to include the crash message from the client and
the version of Lustre you are using.
Cheers, Andreas
On 2010-05-19, at 6:34, Stefano Elmopi <[email protected]>
wrote:
Hi,
I have a small problem but it certainly is the fault of the little
knowledge I have by the argument.
I have a Lustre file system with a node MGS/MDS, two nodes OSS and
one Client.
I launch a copy of a large file on Lustre and while the copy goes on,
I restart the node OSS that is handling the writing on the File
System.
The copy process is put in the state -stalled- and when the node
OSS is back on,
I expected the copy process to resume normally, but instead crashes.
This is a log on the node MGS:
May 19 13:43:43 mdt01prdpom kernel: Lustre: 3827:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230433
sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 17s ago
has timed out (17s prior to deadline).
May 19 13:43:43 mdt01prdpom kernel: r...@ffff81012e11e400
x1336168048230433/t0 o400->[email protected]@tcp:
28/4 lens 192/384 e 0 to 1 dl 1274269423 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:43:43 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection to service lustre01-OST0000 via nid 172.16.100....@tcp
was lost; in progress operations using this service will wait for
recovery to complete.
May 19 13:44:09 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230435
sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 26s ago
has timed out (26s prior to deadline).
May 19 13:44:09 mdt01prdpom kernel: r...@ffff81012e5f2000
x1336168048230435/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269449 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 2s
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 19 13:44:37 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff81012d3e5800 x1336168048230437/t0 o8->[email protected]
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 2 fl Rpc:N/0/0 rc
0/0
May 19 13:44:37 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230437
sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has
failed due to network error (27s prior to deadline).
May 19 13:44:37 mdt01prdpom kernel: r...@ffff81012d3e5800
x1336168048230437/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269504 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 3s
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(lib-move.c:
2441:LNetPut()) Error sending PUT to 12345-172.16.100....@tcp: -113
May 19 13:45:33 mdt01prdpom kernel: LustreError: 3828:0:(events.c:
66:request_out_callback()) @@@ type 4, status -113
r...@ffff81012e11e400 x1336168048230441/t0 o8->[email protected]
@tcp:28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 2 fl Rpc:N/0/0 rc
0/0
May 19 13:45:33 mdt01prdpom kernel: Lustre: 3828:0:(client.c:
1463:ptlrpc_expire_one_request()) @@@ Request x1336168048230441
sent from lustre01-OST0000-osc to NID 172.16.100....@tcp 0s ago has
failed due to network error (28s prior to deadline).
May 19 13:45:33 mdt01prdpom kernel: r...@ffff81012e11e400
x1336168048230441/t0 o8->[email protected]@tcp:
28/4 lens 368/584 e 0 to 1 dl 1274269561 ref 1 fl Rpc:N/0/0 rc 0/0
May 19 13:46:31 mdt01prdpom kernel: Lustre: 3829:0:(import.c:
517:import_select_connection()) lustre01-OST0000-osc: tried all
connections, increasing latency to 4s
May 19 13:46:31 mdt01prdpom kernel: LustreError: 167-0: This client
was evicted by lustre01-OST0000; in progress operations using this
service will fail.
May 19 13:46:31 mdt01prdpom kernel: Lustre: 4099:0:(quota_master.c:
1716:mds_quota_recovery()) Only 0/2 OSTs are active, abort quota
recovery
May 19 13:46:31 mdt01prdpom kernel: Lustre: lustre01-OST0000-osc:
Connection restored to service lustre01-OST0000 using nid
172.16.100....@tcp.
May 19 13:46:31 mdt01prdpom kernel: Lustre: MDS lustre01-MDT0000:
lustre01-OST0000_UUID now active, resetting orphans
is a timeout problem ??
How can I change the timeout ?
Thanks !!!
Ing. Stefano Elmopi
Gruppo Darco - Resp. ICT Sistemi
Via Ostiense 131/L Corpo B, 00154 Roma
cell. 3466147165
tel. 0657060500
email:[email protected]
"Ai sensi e per effetti della legge sulla tutela della
riservatezza personale
(D.lgs n. 196/2003), questa @mail e' destinata unicamente alle
persone sopra
indicate e le informazioni in essa contenute sono da considerarsi
strettamente
riservate. E' proibito leggere, copiare, usare o diffondere il
contenuto della
presente @mail senza autorizzazione. Se avete ricevuto questo
messaggio per
errore, siete pregati di rispedire la stessa al mittente. Grazie"
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss