Hello, In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS with 4 jboxes, each box with 24 disks of 12 TB each, for a total of nearly 1 PB. In all that time we had power failures and failed raid controller cards, all of which made us adjust the configuration. After the last failure, the system keeps sending error messages about OSTs that are no more in the system. In the MDS I do
# lctl dl and I get the 20 currently active OSTs oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST0000 oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d but I also get 5 that are not currently active, in fact doesn't exist 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 When I try to eliminate them with lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0 I get the error conf_param: invalid option -- 'P' set a permanent config parameter. This command must be run on the MGS node usage: conf_param [-d] <target.keyword=val> -d Remove the permanent setting. If I do lctl --device 28 deactivate I don't get an error, but nothing changes What can I do? Thank you in advance for any help. -- Alejandro Aguilar Sierra LANOT, ICAyCC, UNAM _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
