Any other advice for recovering this directory? 

-----Original Message-----
From: Darby Vicker <darby.vicke...@nasa.gov>
Date: Tuesday, September 13, 2016 at 9:44 AM
To: "Dilger, Andreas" <andreas.dil...@intel.com>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Inaccessible directory

>I did the actual e2fsck last night but that particular directory is still 
>inaccessible:
>
># ls -l /lustre2/mirrors/cpas 
>ls: cannot access /lustre2/mirrors/cpas: Stale file handle
># 
>
>Any advice on what to do next?  
>
>
>-----Original Message-----
>From: Darby Vicker <darby.vicke...@nasa.gov>
>Date: Monday, September 12, 2016 at 4:12 PM
>To: "Dilger, Andreas" <andreas.dil...@intel.com>
>Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
>Subject: Re: [lustre-discuss] Inaccessible directory
>
>>I finally had a chance to do the e2fsck (ready only) – see below.  Unless you 
>>say otherwise, we’ll schedule downtime for the filesystem soon and do this on 
>>the actual MDT.  
>>
>>It would be nice to figure out why this happened in the first place.  Any 
>>advice on how to track this down so we can potentially prevent this in the 
>>future?  
>>
>>
>>
>>[root@hpfs2-eg3-mds0 ~]# e2fsck -fn /dev/Storage/mdt-testsnap
>>e2fsck 1.42.13.wc5 (15-Apr-2016)
>>Pass 1: Checking inodes, blocks, and sizes
>>Deleted inode 14772127 has zero dtime.  Fix? no
>>
>>Deleted inode 85060890 has zero dtime.  Fix? no
>>
>>Pass 2: Checking directory structure
>>Pass 3: Checking directory connectivity
>>Pass 4: Checking reference counts
>>Inode 23596587 ref count is 1, should be 2.  Fix? no
>>
>>Inode 23596588 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25254104 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25254105 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25254138 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25254139 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25776685 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25776686 ref count is 1, should be 2.  Fix? no
>>
>>Inode 25776719 ref count is 1, should be 2.  Fix? no
>>
>>Inode 26294667 ref count is 1, should be 2.  Fix? no
>>
>>Inode 192124340 ref count is 1, should be 3.  Fix? no
>>
>>Pass 5: Checking group summary information
>>Inode bitmap differences:  -14772127 -85060890
>>Fix? no
>>
>>[QUOTA WARNING] Usage inconsistent for ID 0:actual (6880575488, 2980) != 
>>expected (11033616384, 3960)
>>[QUOTA WARNING] Usage inconsistent for ID 33152:actual (323584, 468) != 
>>expected (323584, 469)
>>[QUOTA WARNING] Usage inconsistent for ID 3260:actual (16705683456, 8380739) 
>>!= expected (16705683456, 8380748)
>>Update quota info for quota type 0? no
>>
>>[QUOTA WARNING] Usage inconsistent for ID 0:actual (6880399360, 2022) != 
>>expected (11033440256, 3002)
>>[QUOTA WARNING] Usage inconsistent for ID 3000:actual (31232651264, 41964064) 
>>!= expected (31232663552, 41963909)
>>Update quota info for quota type 1? no
>>
>>
>>hpfs2eg3-MDT0000: ********** WARNING: Filesystem still has errors **********
>>
>>hpfs2eg3-MDT0000: 45080951/213319680 files (0.2% non-contiguous), 
>>36292306/106658816 blocks
>>
>>[root@hpfs2-eg3-mds0 ~]# 
>>
>>
>>
>>From: "Dilger, Andreas" <andreas.dil...@intel.com>
>>Date: Thursday, September 1, 2016 at 4:25 PM
>>To: Darby Vicker <darby.vicke...@nasa.gov>
>>Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
>>Subject: Re: [lustre-discuss] Inaccessible directory
>>
>>Doing a file-level backup of the MDT is mostly useful when migrating from one 
>>backing storage to another backing storage device.  It isn't really useful to 
>>restore individual "files" from an MDT-only backup since they only contain 
>>the file metadata.  Restoring all of the inodes from the MDT file backup 
>>means that all of the inode numbers are changed and a full LFSCK is needed to 
>>rebuild the Object Index (OI) files so that the FID->inode mappings are 
>>correct.  Depending on the size of the filesystem this may take several hours.
>> 
>>For disaster recovery purposes, a device-level backup (dd) is more "plug and 
>>play" in that the whole image is restored from the backup and the LFSCK phase 
>>only needs to handle files that have been modified since the time the backup 
>>was created.
>> 
>>Cheers, Andreas
>>-- 
>>Andreas Dilger
>>Lustre Principal Architect
>>Intel High Performance Data Division
>> 
>>On 2016/09/01, 14:49, "Vicker, Darby (JSC-EG311)" <darby.vicke...@nasa.gov> 
>>wrote:
>> 
>>Thanks.  This is happening on all the clients so its not a DLM lock problem.  
>> 
>>We’ll try the fsck soon. If that come back clean is this likely a hardware 
>>corruption?
>> 
>>Could this have anything to do with our corruption?  
>> 
>>Aug 29 12:13:34 hpfs2-eg3-mds0 kernel: LustreError: 0-0: hpfs2eg3-MDT0000: 
>>trigger OI scrub by RPC for [0x2000079cd:0x3988:0x0], rc = 0 [1]
>> 
>>We make a file level backup of the MDT using the procedure in the lustre 
>>manual on a daily basis and keep a history of those.  We’ve never had a 
>>problem so we’ve never had to restore anything from the backups.  Is the 
>>device level backup (dd) necessary or is file level sufficient?
>> 
>> 
>>From: "Dilger, Andreas" <andreas.dil...@intel.com>
>>Date: Thursday, September 1, 2016 at 11:48 AM
>>To: Darby Vicker <darby.vicke...@nasa.gov>
>>Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
>>Subject: Re: [lustre-discuss] Inaccessible directory
>> 
>>The first thing to try is just cancel the DLM locks on this client, to see if 
>>the problem is just a stale lock cached there:
>> 
>>    lctl set_param ldlm.namespaces.*.lru_size=clear
>> 
>>That likely won't help if this same problem is being hit on all of your 
>>clients. 
>>
>>Next to try is running e2fsck on the MDT. It is recommended to use the most 
>>recent e2fsprogs-1.42.12-wc5, since this fixes a number of bugs in older 
>>versions. 
>> 
>>It is also strongly recommended to make a "dd" backup of the whole MDT before 
>>the e2fsck run, and to do this on a regular basis. This is convenient in case 
>>of MDT corruption (now or in the future), and you can restore only the MDT 
>>instead of having to restore 400TB of data as well.  I was just updating the 
>>Lustre User Manual about this http://review.whamcloud.com/21726 .
>> 
>>Running "e2fsck -fn" first can also give you an idea of what kind of 
>>corruption is present before Making the fix. 
>> 
>>Cheers, Andreas
>>
>>On Aug 31, 2016, at 10:54, Vicker, Darby (JSC-EG311) 
>><darby.vicke...@nasa.gov> wrote:
>>Hello,
>>
>>We’ve run into a problem where an entire directory on our lustre file system 
>>has become inaccessible.  
>>
>># mount | grep lustre2
>>192.52.98.142@tcp:/hpfs2eg3 on /lustre2 type lustre (rw,flock)
>># ls -l /lustre2/mirrors/cpas
>>ls: cannot access /lustre2/mirrors/cpas: Stale file handle
>># ls -l /lustre2/mirrors/
>>ls: cannot access /lustre2/mirrors/cpas: Stale file handle
>>4 drwxr-s--- 5 root     g27122      4096 Dec 23  2014 cev-repo/
>>? d????????? ? ?        ?              ?            ? cpas/
>>4 drwxrwxr-x 5 root     eg3         4096 Aug 21  2014 sls/
>>
>>
>>
>>Fortunately, we have a backup of this directory from about a week ago.  
>>However, I would like to figure out how this happened to prevent any further 
>>damage.  I’m not sure if we’re dealing with corruption in the LFS, damage to 
>>the underlying RAID or something else and I’d appreciate some help figuring 
>>this out.  Here’s some info on our lustre servers:
>>
>>CentOS 6.4 2.6.32-358.23.2.el6_lustre.x86_64
>>Lustre 2.4.3 (I know - we need to upgrade...)
>>Hardware RAID 10 MDT (Adaptec 6805Q – SSD’s)
>>(19x) OSS’s - Hardware RAID 6 OST’s (Adaptec 6445)
>>1 27TB OST per OSS, ldiskfs
>>Dual homed via Ethernet and IB
>>
>>Most Ethernet clients (~50 total) are CentOS 7 using 
>>lustre-client-2.8.0-3.10.0_327.13.1.el7.x86_64.x86_64.  Our compute nodes 
>>(~400 total) connect over IB and are still CentOS 6 using 
>>lustre-client-2.7.0-2.6.32_358.14.1.el6.x86_64.x86_64.
>>
>>
>>The lustre server hardware is about 4 years old now.  All RAID arrays are 
>>reported as healthy.  Searched JIRA and the mailing lists and couldn’t find 
>>anything related.  This sounded close at first:
>>
>>https://jira.hpdd.intel.com/browse/LU-3550
>>
>>But, as shown above, the issue shows up on a native lustre client, not an NFS 
>>export.  We are exporting this LFS via SMB but I don’t think that’s related.  
>>
>>I think the next step is to run an e2fsck but we haven’t done that yet and 
>>would appreciate advise on stepping through this.  
>>
>>Thanks,
>>Darby
>>
>>
>>
>>
>>
>>_______________________________________________
>>lustre-discuss mailing list
>>lustre-discuss@lists.lustre.org
>>http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to