Re: [Lustre-discuss] How To change server recovery timeout

Wojciech Turek Wed, 07 Nov 2007 10:46:48 -0800

Hi Cliff,

On 7 Nov 2007, at 17:58, Cliff White wrote:

Wojciech Turek wrote:

Hi,
Our lustre environment is:
2.6.9-55.0.9.EL_lustre.1.6.3smp

I would like to change recovery timeout from default value 250s tosomething longer

I tried example from manual:
set_timeout <secs> Sets the timeout (obd_timeout) for a server
to wait before failing recovery.

We performed that experiment on our test lustre installation withone OST.

storage02 is our OSS
[EMAIL PROTECTED] ~]# lctl dl
  0 UP mgc [EMAIL PROTECTED] 31259d9b-e655-cdc4-c760-45d3df426d86 5
  1 UP ost OSS OSS_uuid 3
  2 UP obdfilter home-md-OST0001 home-md-OST0001_UUID 7
[EMAIL PROTECTED] ~]# lctl --device 2 set_timeout 600
set_timeout has been deprecated. Use conf_param instead.
e.g. conf_param lustre-MDT0000 obd_timeout=50
usage: conf_param obd_timeout=<secs>
run <command> after connecting to device <devno>
--device <devno> <command [args ...]>
[EMAIL PROTECTED] ~]# lctl --device 1 conf_param obd_timeout=600
No device found for name MGS: Invalid argument
error: conf_param: No such device

It looks like I need to run this command from MGS node so I movedthen to MGS server called storage03

[EMAIL PROTECTED] ~]# lctl dl
  0 UP mgs MGS MGS 9
  1 UP mgc [EMAIL PROTECTED] f51a910b-a08e-4be6-5ada-b602a5ca9ab3 5
  2 UP mdt MDS MDS_uuid 3
  3 UP lov home-md-mdtlov home-md-mdtlov_UUID 4
  4 UP mds home-md-MDT0000 home-md-MDT0000_UUID 5
  5 UP osc home-md-OST0001-osc home-md-mdtlov_UUID 5
[EMAIL PROTECTED] ~]# lctl device 5
[EMAIL PROTECTED] ~]# lctl conf_param obd_timeout=600
error: conf_param: Function not implemented
[EMAIL PROTECTED] ~]# lctl --device 5 conf_param obd_timeout=600
error: conf_param: Function not implemented
[EMAIL PROTECTED] ~]# lctl help conf_param

conf_param: set a permanent config param. This command must be runon the MGS node

usage: conf_param <target.keyword=val> ...
[EMAIL PROTECTED] ~]# lctl conf_param home-md-MDT0000.obd_timeout=600
error: conf_param: Invalid argument
[EMAIL PROTECTED] ~]#

I searched whole /proc/*/lustre for file that can store thistimeout value but nothing were found.

Could someone advise how to change value for recovery timeout?
Cheers,
Wojciech Turek


It looks like your file system is named 'home' - you can confirm with
tunefs.lustre --print <MDS device> | grep "Lustre FS"

The correct command (Run on the MGS) would be
# lctl conf_param home.sys.timeout=<val>

Example:
[EMAIL PROTECTED] ~]# tunefs.lustre --print /dev/sdb |grep "Lustre FS"
Lustre FS:  lustre
[EMAIL PROTECTED] ~]# cat /proc/sys/lustre/timeout
130
[EMAIL PROTECTED] ~]# lctl conf_param lustre.sys.timeout=150
[EMAIL PROTECTED] ~]# cat /proc/sys/lustre/timeout
150

Thanks for your email. I am afraid your tips aren't very helpful inthis case. As stated in the subject I am asking about recovery timeout.You can find it for example in /proc/fs/lustre/obdfilter/<OST>/recovery_status whilst one of your OST's is in recovery state. Bydefault this timeout is 250s.Whereas you are talking about system obd timeout (according to CFSdocumentation chapter 4.1.2 ) which is not a subject of my concern.

Any way I tried your example just to see if it works and again I amafraid it doesn't work for me, see below:

I have combined mgs and mds configuration.

[EMAIL PROTECTED] ~]# df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1             10317828   3452824   6340888  36% /
/dev/sda6              7605856     49788   7169708   1% /local
/dev/sda3              4127108     41000   3876460   2% /tmp
/dev/sda2              4127108    753668   3163792  20% /var
/dev/dm-2            1845747840 447502120 1398245720  25% /mnt/sdb
/dev/dm-1            6140723200 4632947344 1507775856  76% /mnt/sdc
/dev/dm-3            286696376   1461588 268850900   1% /mnt/home-md/mdt
[EMAIL PROTECTED] ~]# tunefs.lustre --print /dev/dm-3 |grep "Lustre FS"
Lustre FS:  home-md
Lustre FS:  home-md
[EMAIL PROTECTED] ~]# cat /proc/sys/lustre/timeout
100
[EMAIL PROTECTED] ~]# lctl conf_param home-md.sys.timeout=150
error: conf_param: Invalid argument
[EMAIL PROTECTED] ~]#

Cheers,

Wojciech Turek


cliffw

------------------------------------------------------------------------
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss


Mr Wojciech Turek
Assistant System Manager
University of Cambridge
High Performance Computing service
email: [EMAIL PROTECTED]
tel. +441223763517

_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] How To change server recovery timeout

Reply via email to