Following https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost today I did the following with apparently no difference
lctl set_param osp.lustre-OST0018-osc-MDT0000.max_create_count=0 but I also did lctl --device 30 deactivate and now the 10 zombie OSTs appear as IN, not UP # lctl dl|grep OST|grep IN 18 IN osp lustre-OST000a-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 19 IN osp lustre-OST000b-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 20 IN osp lustre-OST000c-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 21 IN osp lustre-OST000d-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 22 IN osp lustre-OST000e-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 29 IN osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 30 IN osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 31 IN osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 32 IN osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 I also deactivated the OSTs in the client with # lctl set_param osc.lustre-OST000b-*.active=0 osc.lustre-OST000b-osc-ffff979fbcc8b800.active=0 but I still get them as errors in the client: # lfs check osts|grep error lfs check: error: check 'lustre-OST000a-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000b-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000c-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000d-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST000e-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST0014-osc-ffff979fbcc8b800': Cannot allocate memory (12) lfs check: error: check 'lustre-OST0015-osc-ffff979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0016-osc-ffff979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0017-osc-ffff979fbcc8b800': Resource temporarily unavailable (11) lfs check: error: check 'lustre-OST0018-osc-ffff979fbcc8b800': Resource temporarily unavailable (11) I will keep reading the reference, but if you have any suggestion, I will appreciate it. El mié, 9 ago 2023 a la(s) 11:08, Horn, Chris ([email protected]) escribió: > > The error message is stating that ‘-P’ is not valid option to the conf_param > command. You may be thinking of lctl set_param -P … > > Did you follow the documented procedure for removing an OST from the > filesystem when you “adjust[ed] the configuration”? > > https://build.whamcloud.com/job/lustre-manual/lastSuccessfulBuild/artifact/lustre_manual.xhtml#lustremaint.remove_ost > > Chris Horn > > > > From: lustre-discuss <[email protected]> on behalf of > Alejandro Sierra via lustre-discuss <[email protected]> > Date: Wednesday, August 9, 2023 at 11:55 AM > To: Jeff Johnson <[email protected]> > Cc: lustre-discuss <[email protected]> > Subject: Re: [lustre-discuss] How to eliminate zombie OSTs > > Yes, it is. > > El mié, 9 ago 2023 a la(s) 10:49, Jeff Johnson > ([email protected]) escribió: > > > > Alejandro, > > > > Is your MGS located on the same node as your primary MDT? (combined MGS/MDT > > node) > > > > --Jeff > > > > On Wed, Aug 9, 2023 at 9:46 AM Alejandro Sierra via lustre-discuss > > <[email protected]> wrote: > >> > >> Hello, > >> > >> In 2018 we implemented a lustre system 2.10.5 with 20 OSTs in two OSS > >> with 4 jboxes, each box with 24 disks of 12 TB each, for a total of > >> nearly 1 PB. In all that time we had power failures and failed raid > >> controller cards, all of which made us adjust the configuration. After > >> the last failure, the system keeps sending error messages about OSTs > >> that are no more in the system. In the MDS I do > >> > >> # lctl dl > >> > >> and I get the 20 currently active OSTs > >> > >> oss01.lanot.unam.mx - OST00 /dev/disk/by-label/lustre-OST0000 > >> oss01.lanot.unam.mx - OST01 /dev/disk/by-label/lustre-OST0001 > >> oss01.lanot.unam.mx - OST02 /dev/disk/by-label/lustre-OST0002 > >> oss01.lanot.unam.mx - OST03 /dev/disk/by-label/lustre-OST0003 > >> oss01.lanot.unam.mx - OST04 /dev/disk/by-label/lustre-OST0004 > >> oss01.lanot.unam.mx - OST05 /dev/disk/by-label/lustre-OST0005 > >> oss01.lanot.unam.mx - OST06 /dev/disk/by-label/lustre-OST0006 > >> oss01.lanot.unam.mx - OST07 /dev/disk/by-label/lustre-OST0007 > >> oss01.lanot.unam.mx - OST08 /dev/disk/by-label/lustre-OST0008 > >> oss01.lanot.unam.mx - OST09 /dev/disk/by-label/lustre-OST0009 > >> oss02.lanot.unam.mx - OST15 /dev/disk/by-label/lustre-OST000f > >> oss02.lanot.unam.mx - OST16 /dev/disk/by-label/lustre-OST0010 > >> oss02.lanot.unam.mx - OST17 /dev/disk/by-label/lustre-OST0011 > >> oss02.lanot.unam.mx - OST18 /dev/disk/by-label/lustre-OST0012 > >> oss02.lanot.unam.mx - OST19 /dev/disk/by-label/lustre-OST0013 > >> oss02.lanot.unam.mx - OST25 /dev/disk/by-label/lustre-OST0019 > >> oss02.lanot.unam.mx - OST26 /dev/disk/by-label/lustre-OST001a > >> oss02.lanot.unam.mx - OST27 /dev/disk/by-label/lustre-OST001b > >> oss02.lanot.unam.mx - OST28 /dev/disk/by-label/lustre-OST001c > >> oss02.lanot.unam.mx - OST29 /dev/disk/by-label/lustre-OST001d > >> > >> but I also get 5 that are not currently active, in fact doesn't exist > >> > >> 28 IN osp lustre-OST0014-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 > >> 29 UP osp lustre-OST0015-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 > >> 30 UP osp lustre-OST0016-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 > >> 31 UP osp lustre-OST0017-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 > >> 32 UP osp lustre-OST0018-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 4 > >> > >> When I try to eliminate them with > >> > >> lctl conf_param -P osp.lustre-OST0015-osc-MDT0000.active=0 > >> > >> I get the error > >> > >> conf_param: invalid option -- 'P' > >> set a permanent config parameter. > >> This command must be run on the MGS node > >> usage: conf_param [-d] <target.keyword=val> > >> -d Remove the permanent setting. > >> > >> If I do > >> > >> lctl --device 28 deactivate > >> > >> I don't get an error, but nothing changes > >> > >> What can I do? > >> > >> Thank you in advance for any help. > >> > >> -- > >> Alejandro Aguilar Sierra > >> LANOT, ICAyCC, UNAM > >> _______________________________________________ > >> lustre-discuss mailing list > >> [email protected] > >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > > > > > > > -- > > ------------------------------ > > Jeff Johnson > > Co-Founder > > Aeon Computing > > > > [email protected] > > http://www.aeoncomputing.com > > t: 858-412-3810 x1001 f: 858-412-3845 > > m: 619-204-9061 > > > > 4170 Morena Boulevard, Suite C - San Diego, CA 92117 > > > > High-Performance Computing / Lustre Filesystems / Scale-out Storage > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
