I finally had a chance to do the e2fsck (ready only) – see below.  Unless you 
say otherwise, we’ll schedule downtime for the filesystem soon and do this on 
the actual MDT.  

It would be nice to figure out why this happened in the first place.  Any 
advice on how to track this down so we can potentially prevent this in the 
future?  



[root@hpfs2-eg3-mds0 ~]# e2fsck -fn /dev/Storage/mdt-testsnap
e2fsck 1.42.13.wc5 (15-Apr-2016)
Pass 1: Checking inodes, blocks, and sizes
Deleted inode 14772127 has zero dtime.  Fix? no

Deleted inode 85060890 has zero dtime.  Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Inode 23596587 ref count is 1, should be 2.  Fix? no

Inode 23596588 ref count is 1, should be 2.  Fix? no

Inode 25254104 ref count is 1, should be 2.  Fix? no

Inode 25254105 ref count is 1, should be 2.  Fix? no

Inode 25254138 ref count is 1, should be 2.  Fix? no

Inode 25254139 ref count is 1, should be 2.  Fix? no

Inode 25776685 ref count is 1, should be 2.  Fix? no

Inode 25776686 ref count is 1, should be 2.  Fix? no

Inode 25776719 ref count is 1, should be 2.  Fix? no

Inode 26294667 ref count is 1, should be 2.  Fix? no

Inode 192124340 ref count is 1, should be 3.  Fix? no

Pass 5: Checking group summary information
Inode bitmap differences:  -14772127 -85060890
Fix? no

[QUOTA WARNING] Usage inconsistent for ID 0:actual (6880575488, 2980) != 
expected (11033616384, 3960)
[QUOTA WARNING] Usage inconsistent for ID 33152:actual (323584, 468) != 
expected (323584, 469)
[QUOTA WARNING] Usage inconsistent for ID 3260:actual (16705683456, 8380739) != 
expected (16705683456, 8380748)
Update quota info for quota type 0? no

[QUOTA WARNING] Usage inconsistent for ID 0:actual (6880399360, 2022) != 
expected (11033440256, 3002)
[QUOTA WARNING] Usage inconsistent for ID 3000:actual (31232651264, 41964064) 
!= expected (31232663552, 41963909)
Update quota info for quota type 1? no


hpfs2eg3-MDT0000: ********** WARNING: Filesystem still has errors **********

hpfs2eg3-MDT0000: 45080951/213319680 files (0.2% non-contiguous), 
36292306/106658816 blocks

[root@hpfs2-eg3-mds0 ~]# 



From: "Dilger, Andreas" <andreas.dil...@intel.com>
Date: Thursday, September 1, 2016 at 4:25 PM
To: Darby Vicker <darby.vicke...@nasa.gov>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Inaccessible directory

Doing a file-level backup of the MDT is mostly useful when migrating from one 
backing storage to another backing storage device.  It isn't really useful to 
restore individual "files" from an MDT-only backup since they only contain the 
file metadata.  Restoring all of the inodes from the MDT file backup means that 
all of the inode numbers are changed and a full LFSCK is needed to rebuild the 
Object Index (OI) files so that the FID->inode mappings are correct.  Depending 
on the size of the filesystem this may take several hours.
 
For disaster recovery purposes, a device-level backup (dd) is more "plug and 
play" in that the whole image is restored from the backup and the LFSCK phase 
only needs to handle files that have been modified since the time the backup 
was created.
 
Cheers, Andreas
-- 
Andreas Dilger
Lustre Principal Architect
Intel High Performance Data Division
 
On 2016/09/01, 14:49, "Vicker, Darby (JSC-EG311)" <darby.vicke...@nasa.gov> 
wrote:
 
Thanks.  This is happening on all the clients so its not a DLM lock problem.  
 
We’ll try the fsck soon. If that come back clean is this likely a hardware 
corruption?
 
Could this have anything to do with our corruption?  
 
Aug 29 12:13:34 hpfs2-eg3-mds0 kernel: LustreError: 0-0: hpfs2eg3-MDT0000: 
trigger OI scrub by RPC for [0x2000079cd:0x3988:0x0], rc = 0 [1]
 
We make a file level backup of the MDT using the procedure in the lustre manual 
on a daily basis and keep a history of those.  We’ve never had a problem so 
we’ve never had to restore anything from the backups.  Is the device level 
backup (dd) necessary or is file level sufficient?
 
 
From: "Dilger, Andreas" <andreas.dil...@intel.com>
Date: Thursday, September 1, 2016 at 11:48 AM
To: Darby Vicker <darby.vicke...@nasa.gov>
Cc: "lustre-discuss@lists.lustre.org" <lustre-discuss@lists.lustre.org>
Subject: Re: [lustre-discuss] Inaccessible directory
 
The first thing to try is just cancel the DLM locks on this client, to see if 
the problem is just a stale lock cached there:
 
    lctl set_param ldlm.namespaces.*.lru_size=clear
 
That likely won't help if this same problem is being hit on all of your 
clients. 

Next to try is running e2fsck on the MDT. It is recommended to use the most 
recent e2fsprogs-1.42.12-wc5, since this fixes a number of bugs in older 
versions. 
 
It is also strongly recommended to make a "dd" backup of the whole MDT before 
the e2fsck run, and to do this on a regular basis. This is convenient in case 
of MDT corruption (now or in the future), and you can restore only the MDT 
instead of having to restore 400TB of data as well.  I was just updating the 
Lustre User Manual about this http://review.whamcloud.com/21726 .
 
Running "e2fsck -fn" first can also give you an idea of what kind of corruption 
is present before Making the fix. 
 
Cheers, Andreas

On Aug 31, 2016, at 10:54, Vicker, Darby (JSC-EG311) <darby.vicke...@nasa.gov> 
wrote:
Hello,

We’ve run into a problem where an entire directory on our lustre file system 
has become inaccessible.  

# mount | grep lustre2
192.52.98.142@tcp:/hpfs2eg3 on /lustre2 type lustre (rw,flock)
# ls -l /lustre2/mirrors/cpas
ls: cannot access /lustre2/mirrors/cpas: Stale file handle
# ls -l /lustre2/mirrors/
ls: cannot access /lustre2/mirrors/cpas: Stale file handle
4 drwxr-s--- 5 root     g27122      4096 Dec 23  2014 cev-repo/
? d????????? ? ?        ?              ?            ? cpas/
4 drwxrwxr-x 5 root     eg3         4096 Aug 21  2014 sls/



Fortunately, we have a backup of this directory from about a week ago.  
However, I would like to figure out how this happened to prevent any further 
damage.  I’m not sure if we’re dealing with corruption in the LFS, damage to 
the underlying RAID or something else and I’d appreciate some help figuring 
this out.  Here’s some info on our lustre servers:

CentOS 6.4 2.6.32-358.23.2.el6_lustre.x86_64
Lustre 2.4.3 (I know - we need to upgrade...)
Hardware RAID 10 MDT (Adaptec 6805Q – SSD’s)
(19x) OSS’s - Hardware RAID 6 OST’s (Adaptec 6445)
1 27TB OST per OSS, ldiskfs
Dual homed via Ethernet and IB

Most Ethernet clients (~50 total) are CentOS 7 using 
lustre-client-2.8.0-3.10.0_327.13.1.el7.x86_64.x86_64.  Our compute nodes (~400 
total) connect over IB and are still CentOS 6 using 
lustre-client-2.7.0-2.6.32_358.14.1.el6.x86_64.x86_64.


The lustre server hardware is about 4 years old now.  All RAID arrays are 
reported as healthy.  Searched JIRA and the mailing lists and couldn’t find 
anything related.  This sounded close at first:

https://jira.hpdd.intel.com/browse/LU-3550

But, as shown above, the issue shows up on a native lustre client, not an NFS 
export.  We are exporting this LFS via SMB but I don’t think that’s related.  

I think the next step is to run an e2fsck but we haven’t done that yet and 
would appreciate advise on stepping through this.  

Thanks,
Darby





_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to