You might try to --writeconf your system? The data seen by 'tunefs.lustre' as 'Permanent disk data', are they now actually on-disk? In other words, if you run 'tunefs.lustre --dryrun' on that OST, does it now have 'lustre-OST000f'?
Afaik, such a change in parameters can only be propagated to all servers by umount, 'tunefs.lustre --writeconf ...' on all targets and then restart. However, this is just a rough guess. For your clients to lose Lustre like this, something ultra-weird must be going on on the MDS with that ffff. Hmm, did you try to deactivate OST000f, too? Regards, Thomas Btw, our 2.5.3 MDS also shows deactivated OSTs as 'UP' in the lctl-dl listing. On 05/23/2018 03:04 PM, Torsten Harenberg wrote: > Dear all, > > we are running a Lustre 2.5.3 installation for a couple of years > already. The devices come from the 3PAR SAN appliance. > > Our users asked us to enlarge the available disk space, so we exported > two new LUNs to the OST servers. > > File systems have been created with: > > mkfs.lustre --fsname=lustre --ost --index 15 --backfstype=ldiskfs > --failnode=<IP>@tcp --mgsnode=<IP>@tcp > --mgsnode=<IP>@tcp --verbose /dev/mapper/OST000F > > which went fine. > > However, after mounting, the file system appears as > > lustre-OSTffff_UUID 8585168804 35177704 8120481472 0% > /lustre[OST:65535] > > in lfs df. > > And lfs df prints 65k+ lines with > > OSTfff5 : Resource temporarily unavailable > OSTfff6 : Resource temporarily unavailable > OSTfff7 : Resource temporarily unavailable > OSTfff8 : Resource temporarily unavailable > OSTfff9 : Resource temporarily unavailable > OSTfffa : Resource temporarily unavailable > OSTfffb : Resource temporarily unavailable > OSTfffc : Resource temporarily unavailable > OSTfffd : Resource temporarily unavailable > OSTfffe : Resource temporarily unavailable > > in between. > > Searching for the root of this, we saw: > > ------ > [root@lustre4 ~]# tunefs.lustre /dev/mapper/OST000F > checking for existing Lustre data: found > Reading CONFIGS/mountdata > > Read previous values: > Target: lustre-OSTffff > Index: 15 > Lustre FS: lustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro > Parameters: failover.node=<IP>@tcp > mgsnode=<IP>@tcp mgsnode=<IP>@tcp > > > Permanent disk data: > Target: lustre-OST000f > Index: 15 > Lustre FS: lustre > Mount type: ldiskfs > Flags: 0x2 > (OST ) > Persistent mount opts: errors=remount-ro > Parameters: failover.node=<IP>@tcp > mgsnode=<IP>@tcp mgsnode=<IP>@tcp > ------ > > > No idea where the > > Read previous values: > Target: lustre-OSTffff > > comes from. > > Now we were trying to free the OST immediately, which turns out to be > more complicated than expected. > > We tried to follow the manual and issued on the MDS: > > [root@lustre1 ~]# lctl --device lustre-OSTffff-osc-MDT0000 deactivate > > But device still is "UP": > > [root@lustre1 ~]# lctl dl > 0 UP osd-ldiskfs lustre-MDT0000-osd lustre-MDT0000-osd_UUID 24 > 1 UP mgs MGS MGS 427 > 2 UP mgc MGC132.195.124.201@tcp 17eb290e-d0a6-2047-3250-84f893ebc47a 5 > 3 UP mds MDS MDS_uuid 3 > 4 UP lod lustre-MDT0000-mdtlov lustre-MDT0000-mdtlov_UUID 4 > 5 UP mdt lustre-MDT0000 lustre-MDT0000_UUID 455 > 6 UP mdd lustre-MDD0000 lustre-MDD0000_UUID 4 > 7 UP qmt lustre-QMT0000 lustre-QMT0000_UUID 4 > 8 UP osp lustre-OST0000-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 9 UP osp lustre-OST0001-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 10 UP osp lustre-OST0002-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 11 UP osp lustre-OST0003-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 12 UP osp lustre-OST0004-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 13 UP osp lustre-OST0005-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 14 UP osp lustre-OST0006-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 15 UP osp lustre-OST0007-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 16 UP osp lustre-OST0008-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 17 UP osp lustre-OST0009-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 18 UP osp lustre-OST000a-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 19 UP osp lustre-OST000b-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 20 UP osp lustre-OST000c-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 21 UP osp lustre-OST000d-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 22 UP osp lustre-OST000e-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 23 UP osp lustre-OSTffff-osc-MDT0000 lustre-MDT0000-mdtlov_UUID 5 > 24 UP lwp lustre-MDT0000-lwp-MDT0000 lustre-MDT0000-lwp-MDT0000_UUID 5 > > We set it degraded on the OST: > > [root@lustre4 ~]# lctl get_param obdfilter.*.degraded > obdfilter.lustre-OST0008.degraded=0 > obdfilter.lustre-OST0009.degraded=0 > obdfilter.lustre-OST000a.degraded=0 > obdfilter.lustre-OST000b.degraded=0 > obdfilter.lustre-OST000c.degraded=0 > obdfilter.lustre-OST000d.degraded=0 > obdfilter.lustre-OST000e.degraded=0 > obdfilter.lustre-OSTffff.degraded=1 > > > But still the file system usage grows: > > [root@wnfg001 ~]# lfs df /lustre | grep ffff > lustre-OSTffff_UUID 8585168804 35159988 8120496592 0% > /lustre[OST:65535] > [root@wnfg001 ~]# lfs df /lustre | grep ffff > lustre-OSTffff_UUID 8585168804 35177704 8120481472 0% > /lustre[OST:65535] > > > We could stop usage by setting if inactive on ALL (200+ in our case) > clients with > > lctl set_param osc.lustre-OSTffff-*.active=0 > > But then the file system becomes unusable for the users: > > -bash-4.1# touch > /lustre/gridsoft/arc/session/LeENDmtfhfsnsBfJnpimw0EmABFKDmABFKDmxSGKDmABFKDmhbxd6n/qq2 > touch: setting times of > `/lustre/gridsoft/arc/session/LeENDmtfhfsnsBfJnpimw0EmABFKDmABFKDmxSGKDmABFKDmhbxd6n/qq2': > Cannot send after transport endpoint shutdown > > same is true for "lctl --device XX deactivate". > > > So we are looking for ways now to: > > 1.) set the OST read-only but keeping the file system usable > 2.) then migrate what's on this OSTffff (we started a lfs find already, > but it takes very long) > 3.) remove the OST and start from scratch. > > And really nice would be to understand where the OSTffff comes from and > how one can avoid it. > > > Any hint is really appreciated. > > Best regards > > Torsten > > > -- -------------------------------------------------------------------- Thomas Roth Department: Informationstechnologie Location: SB3 2.291 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Ursula Weyrich Professor Dr. Paolo Giubellino Jörg Blaurock Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
