Hi Kurt, probably too late if you unlinked the files: Did you do zfs snapshot on MDT and damaged OST before removing files? I so, it may be possible to mount ost zfs as a regualr zfs and pull out objects corresponding to files. mdt zfs snapshot to get fids. Alex.
On Jan 22, 2016, at 7:39 AM, Kurt Strosahl <[email protected]> wrote: > Good Morning, > > The real issue here is that the OST was decomissioned because the zpool on > which it resided died, which left about 30TB of data (and possibly several > million files) to be scrubbed. > > The steps I took were as follows... I set active=0 on the mds, and then set > lazystatfs=1 on the mds and the clients so that df commands wouldn't hang. > > I don't see in the documentation where you have to set the ost to active=0 > on every client, did I miss that? Also that is a marked change from 1.8, > where deactivating an OST just required active=0 on the mds. > > w/r, > Kurt > > ----- Original Message ----- > From: "Sean Brisbane" <[email protected]> > To: "Kurt Strosahl" <[email protected]>, "Chris Hunter" <[email protected]> > Cc: [email protected] > Sent: Friday, January 22, 2016 4:33:41 AM > Subject: RE: Inactivated ost still showing up on the mds > > Dear Kurt, > > Im not sure if this is exactly what you were trying to do, but when I > decommission an OST I also deactivate the OST on the client, which means that > nothing on the OST will be accessible but the filesystem will carry on > happily. > > lctl set_param osc.lustresystem-OST00NN-osc*.active=0 > > Thanks, > Sean > ________________________________________ > From: lustre-discuss [[email protected]] on behalf of > Kurt Strosahl [[email protected]] > Sent: 21 January 2016 18:09 > To: Chris Hunter > Cc: [email protected] > Subject: Re: [lustre-discuss] Inactivated ost still showing up on the mds > > Good Afternoon Chris, > > I have already run the active=0 command on the mds, is there another step? > From my testing under 2.5.3 the clients will hang indefinitely without using > the lazystatfs=1. > > Our major issue at present is that when the OST died it had a fair amount > of data on in (closing in on 2M files lost), and it seems like the client > gets into a bad state when calls re made repeatedly to files that are lost > (but still have their ost index information). As the crawl has unlinked > files the number of errors has dropped, as have client crashes. > > w/r, > Kurt > > ----- Original Message ----- > From: "Chris Hunter" <[email protected]> > To: [email protected] > Cc: "Kurt Strosahl" <[email protected]> > Sent: Thursday, January 21, 2016 12:50:03 PM > Subject: [lustre-discuss] Inactivated ost still showing up on the mds > > Hi Kurt, > For reference when an underlying OST object is missing, this is the > error message generated on our MDS (lustre 2.5): >> Lustre: 12752:0:(mdd_object.c:1983:mdd_dir_page_build()) build page failed: >> -5! > > I suspect until you update the MGS info the MDS will still connect to > the deactive OST. > > My experience is sometimes the recipe to deactivate an OST works > flawlessly sometimes other times the clients hang on "df" command and > timeout on file access. I guess the order which you run the commands > (ie. client vs server) is important. > > regards, > chris hunter > >> From: Kurt Strosahl <[email protected]> >> To: [email protected] >> Subject: [lustre-discuss] Inactivated ost still showing up on the mds >> >> All, >> >> Continuing the issues that I reported yesterday... I found that by >> unlinking lost files that I was able to stop the below error from occurring, >> this gives me hope that systems will stop crashing once all the lost files >> are scrubbed. >> >> LustreError: 7676:0:(sec.c:379:import_sec_validate_get()) import >> ffff880623098800 (NEW) with no sec >> LustreError: 7971:0:(sec.c:379:import_sec_validate_get()) import >> ffff880623098800 (NEW) with no sec >> >> I do note that the inactivated ost doesn't seem to ever REALLY go away. >> After I removed an ost from my test system I noticed that the mds still >> showed it... >> >> On a client hooked up to the test system... >> client: lfs df >> UUID 1K-blocks Used Available Use% Mounted on >> testL-MDT0000_UUID 1819458432 10112 1819446272 0% >> /testlustre[MDT:0] >> testL-OST0000_UUID 57914433152 12672 57914418432 0% >> /testlustre[OST:0] >> testL-OST0001_UUID 57914433408 12672 57914418688 0% >> /testlustre[OST:1] >> testL-OST0002_UUID 57914433408 12672 57914418688 0% >> /testlustre[OST:2] >> OST0003 : inactive device >> testL-OST0004_UUID 57914436992 144896 57914290048 0% >> /testlustre[OST:4 >> >> on the mds it still shows as up when I do lctl dl: >> mds: lctl dl | grep OST0003 >> 22 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5 >> >> So I stopped the test system, ran lctl dl again (getting no results), and >> restarted it. Once the system was back up I still saw OST3 marked as UP >> with lctl dl: >> mds: lctl dl | grep OST0003 >> 11 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5 >> >> Why does the mds still think that this OST is up? >> > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
