Hi Kurt,
probably too late if you unlinked the files:
Did you do zfs snapshot on MDT and damaged OST before removing files?
I so, it may be possible to mount ost zfs as a regualr zfs and pull out objects 
corresponding to files.
mdt zfs snapshot to get fids.
Alex.

On Jan 22, 2016, at 7:39 AM, Kurt Strosahl <[email protected]> wrote:

> Good Morning,
> 
>   The real issue here is that the OST was decomissioned because the zpool on 
> which it resided died, which left about 30TB of data (and possibly several 
> million files) to be scrubbed.
> 
>   The steps I took were as follows... I set active=0 on the mds, and then set 
> lazystatfs=1 on the mds and the clients so that df commands wouldn't hang.
> 
>   I don't see in the documentation where you have to set the ost to active=0 
> on every client, did I miss that?  Also that is a marked change from 1.8, 
> where deactivating an OST just required active=0 on the mds.
> 
> w/r,
> Kurt
> 
> ----- Original Message -----
> From: "Sean Brisbane" <[email protected]>
> To: "Kurt Strosahl" <[email protected]>, "Chris Hunter" <[email protected]>
> Cc: [email protected]
> Sent: Friday, January 22, 2016 4:33:41 AM
> Subject: RE: Inactivated ost still showing up on the mds
> 
> Dear Kurt,
> 
> Im not sure if this is exactly what you were trying to do, but when I 
> decommission an OST I also deactivate the OST on the client, which means that 
> nothing on the OST will be accessible but the filesystem will carry on 
> happily.  
> 
> lctl set_param osc.lustresystem-OST00NN-osc*.active=0
> 
> Thanks,
> Sean
> ________________________________________
> From: lustre-discuss [[email protected]] on behalf of 
> Kurt Strosahl [[email protected]]
> Sent: 21 January 2016 18:09
> To: Chris Hunter
> Cc: [email protected]
> Subject: Re: [lustre-discuss] Inactivated ost still showing up on the mds
> 
> Good Afternoon Chris,
> 
>   I have already run the active=0 command on the mds, is there another step?  
> From my testing under 2.5.3 the clients will hang indefinitely without using 
> the lazystatfs=1.
> 
>   Our major issue at present is that when the OST died it had a fair amount 
> of data on in (closing in on 2M files lost), and it seems like the client 
> gets into a bad state when calls re made repeatedly to files that are lost 
> (but still have their ost index information).  As the crawl has unlinked 
> files the number of errors has dropped, as have client crashes.
> 
> w/r,
> Kurt
> 
> ----- Original Message -----
> From: "Chris Hunter" <[email protected]>
> To: [email protected]
> Cc: "Kurt Strosahl" <[email protected]>
> Sent: Thursday, January 21, 2016 12:50:03 PM
> Subject: [lustre-discuss] Inactivated ost still showing up on the mds
> 
> Hi Kurt,
> For reference when an underlying OST object is missing, this is the
> error message generated on our MDS (lustre 2.5):
>> Lustre: 12752:0:(mdd_object.c:1983:mdd_dir_page_build()) build page failed: 
>> -5!
> 
> I suspect until you update the MGS info the MDS will still connect to
> the deactive OST.
> 
> My experience is sometimes the recipe to deactivate an OST works
> flawlessly sometimes other times the clients hang on "df" command and
> timeout on file access. I guess the order which you run the commands
> (ie. client vs server) is important.
> 
> regards,
> chris hunter
> 
>> From: Kurt Strosahl <[email protected]>
>> To: [email protected]
>> Subject: [lustre-discuss] Inactivated ost still showing up on the mds
>> 
>> All,
>> 
>>   Continuing the issues that I reported yesterday...  I found that by 
>> unlinking lost files that I was able to stop the below error from occurring, 
>> this gives me hope that systems will stop crashing once all the lost files 
>> are scrubbed.
>> 
>> LustreError: 7676:0:(sec.c:379:import_sec_validate_get()) import 
>> ffff880623098800 (NEW) with no sec
>> LustreError: 7971:0:(sec.c:379:import_sec_validate_get()) import 
>> ffff880623098800 (NEW) with no sec
>> 
>>   I do note that the inactivated ost doesn't seem to ever REALLY go away.  
>> After I removed an ost from my test system I noticed that the mds still 
>> showed it...
>> 
>> On a client hooked up to the test system...
>> client: lfs df
>> UUID                   1K-blocks        Used   Available Use% Mounted on
>> testL-MDT0000_UUID    1819458432       10112  1819446272   0% 
>> /testlustre[MDT:0]
>> testL-OST0000_UUID   57914433152       12672 57914418432   0% 
>> /testlustre[OST:0]
>> testL-OST0001_UUID   57914433408       12672 57914418688   0% 
>> /testlustre[OST:1]
>> testL-OST0002_UUID   57914433408       12672 57914418688   0% 
>> /testlustre[OST:2]
>> OST0003             : inactive device
>> testL-OST0004_UUID   57914436992      144896 57914290048   0% 
>> /testlustre[OST:4
>> 
>> on the mds it still shows as up when I do lctl dl:
>> mds: lctl dl | grep OST0003
>> 22 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5
>> 
>> So I stopped the test system, ran lctl dl again (getting no results), and 
>> restarted it.  Once the system was back up I still saw OST3 marked as UP 
>> with lctl dl:
>> mds: lctl dl | grep OST0003
>> 11 UP osp testL-OST0003-osc-MDT0000 testL-MDT0000-mdtlov_UUID 5
>> 
>> Why does the mds still think that this OST is up?
>> 
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to