Mmm, unfortunately, still not quite right - Disabling dirdata will not
save you in the conversion to HTree case either. It will just prevent
*more* directories from getting a misplaced ".." dentry to begin with.
As to size... I figured it out once - But it depends on file name
length in the directory, since the dentry includes the file name. Once
the total size of dentries in a directory exceeds 4096 bytes (one
inode), then it will be converted to an HTree, I believe.
So, at something like 32 bytes a dentry, which is like a 10-16 or so
character file name (exact dentry length here requires more checking
than I've got time for, but it's close), then you've got 32=2^5, 4096 =
2^12, so 2^12/2^5 = 2^7 or 128 dentries.
But of course, longer file names --> bigger dentries --> fewer dentries
before conversion to HTree.
As far as "easy way to scan", well, fsck set to not make changes will
find all the directories with misplaced ".." dentries, and also any
already damaged-by-conversion-to-HTree directories.
- Patrick
On 11/03/2015 01:12 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
Patrick,
Thanks for the clarification. I think I understand now. Disabling dirdata
would not help any directories which have already had their “..” entry
relocated. The next time fsck runs, those directories will potentially get
corrupted. The bigger reason to disable dirdata is to prevent more serious
corruption if a non-HTree directory with an incorrectly placed “..” gets
converted to a HTree directory.
How large does a directory need to be before the conversion to HTree happens?
I don’t suppose there is an easy way to scan the file system to look for
directories that might be subject to corruption…
—Rick
On Nov 3, 2015, at 12:30 PM, Patrick Farrell <[email protected]> wrote:
Hm. That's almost, but not quite, right. Disabling dirdata during the fsck
run has no positive effect - fsck will still get upset about the incorrectly
placed entry. (And whether or not dirdata is enabled, fsck will do the same
thing. It doesn't know or care about the dirdata setting as such.)
Steps #1 and #2 will not cause any problems until you run fsck, but there's no
way around the issue once you do run fsck. The .. dentry must go back to the
correct location to make fsck happy. If I remember right, fsck creates the ..
dentry and doesn't include the fid (regardless of dirdata setting). This can
overwrite another dentry if one has been placed in the location normally
reserved for the .. dentry (which can happen if the dentry which was after the
.. dentry is deleted, thereby making a space large enough for a dentry+FID).
Furthermore, if you have a non-Htree directory where the .. dentry is incorrectly
placed (your steps 1 & 2), then you add files until it shifts to become an
HTree directory, THAT directory becomes corrupted in a more severe manner that will
cause your MDT to remount read only and/or LBUG. (LU-2638 only fixes the .. dentry
bug for HTree directories themselves. It does not help with a corrupted directory
that then becomes an HTree directory.)
- Patrick
On 11/03/2015 11:17 AM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Oct 27, 2015, at 1:46 PM, Patrick Farrell <[email protected]> wrote:
That's something of a time bomb - If one of those directories fsck wishes it
could correct is small and grows in number of files, you'll get the MDT going
read only (and a few odd LBUGs if you try to put it back).
I was looking back over the incident where I thought I had hit this bug, but
based on the lack of side effects that you mentioned, I am now starting to
think that I was mistaken. Nevertheless, I am trying to understand the bug a
little better in case I am still susceptible to it. I tried to summarize my
understanding below, and maybe you can tell me if I am correct.
For HTree directories, the problem is described in LU-2638. But since I am
running Lustre >2.4, I should not be affected by this bug.
For non-Tree directories, the problem is described in LU-5626. In order to
trigger the bug, the following steps must happen:
1) A non-HTree directory created under Lustre 1.8 (which does not have a FID
for its “..” entry) gets moved to a different parent directory.
2) Lustre tries to update the “..” entry in the directory, and if there is not
enough space in the existing entry, it creates a new “..” entry and adds the
FID.
3) Something happens to the MDT, and fsck needs to be run. When it runs, it
notices that “..” is no longer the second entry in the directory.
4) fsck tries to “fix” the problem by moving the “..” entry back to its
original position. With the FID in place, there is not enough space in the
original position, but fsck moves it anyway which causes the “..” entry to
overwrite part of the third entry in the directory.
If that is correct, then steps #1 and #2 can happen without causing any
problems. It is only at steps #3 and #4 that the corruption occurs, and as
long as dirdata is disabled before fsck is run, then there should not be any
problems.
Is that explanation accurate?
--
Rick Mohr
Senior HPC System Administrator
National Institute for Computational Sciences
http://www.nics.tennessee.edu
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org