[lustre-discuss] OST not recognized as lustre volume. - group descriptors corrupted

Scott Wood via lustre-discuss Sat, 07 May 2022 02:05:09 -0700

Hey folks,

We had a power "incident".  Not sure if it was the cause of our issues or if it 
just brought previous issues to light.  We're a CentOS 7, lustre 2.10.6-1.el7 
(from provided binaries) site, SAS direct connect HA paired OSSs running 
pacemaker to manage failover.  Standard stuff.  After the power incident, some 
OSTS were dropped and remounted but one did not come back.  At this point, that 
OST does not seem to be recognized as a lustre volume.


First step I took was to disable the pacemaker resource and try to mount it 
manually to see how it was doing:

[root@hpcoss02 ~]# mount -t lustre /dev/mapper/mpathg /mnt/OST78
mount.lustre: /dev/mapper/mpathg has not been formatted with mkfs.lustre or the 
backend filesystem type is not supported by this tool

The syslog shows the following at that time (syslog is from a subsequent 
attempt but logs match):
May 07 12:04:15 hpcoss02.adqimr.ad.lan kernel: LDISKFS-fs (dm-6): 
ldiskfs_check_descriptors: Checksum for group 192 failed (39981!=25867)
May 07 12:04:15 hpcoss02.adqimr.ad.lan kernel: LDISKFS-fs (dm-6): group 
descriptors corrupted!

No fun.  Next attempt was to try mounting ldiskfs in case a journal replay 
would help:

[root@hpcoss02 ~]# mount -t ldiskfs /dev/mapper/mpathg /mnt/OST78
mount: wrong fs type, bad option, bad superblock on /dev/mapper/mpathg,
       missing codepage or helper program, or other error

       In some cases useful info is found in syslog - try
       dmesg | tail or so.

Unhappy chappie.  This OST has been up and running happily for months so it 
knows it's part of the stack.  I ran a "tune2fs -l" against it and, from the 
"Filesystem magic number:  0xEF53", it looks like it knows it's a lustre 
volume.  I ran an "e2fsck -n" against it.  I'll spare you the details but am 
happy to answer specifics if you have ideas about what to look for but it does 
not look good.  stdout went to "out" stderr went to "err".  "out" shows the 
following and I can dig deeper or answer questions:

[root@hpcoss01 fsck]# grep "Group descriptor.*checksum is.*, should be" out 
|head -n1
Group descriptor 192 checksum is 0x650b, should be 0x9c2d.  IGNORED.
[root@hpcoss01 fsck]# grep "Group descriptor.*checksum is.*, should be" out |wc 
-l
435299
[root@hpcoss01 fsck]# grep "Free blocks count wrong for group" out |head -n1
Free blocks count wrong for group #192 (32768, counted=0).
[root@hpcoss01 fsck]# grep "Free blocks count wrong for group" out |wc -l
412325

"err" showed quote issues but we're not too concerned about them as we don't 
enforce.

We are currently replicating the block device of the OST to Logical Volume so 
we can run non-destructive tests against an LVM snapshot to see what we get 
(thanks @stu for the suggestion).  We're also running an "lfs find mountpoint 
-obd lustre-OST004e" to get a list of the files that could be lost.  Once we 
have a usable copy of the OST, we intend to "e2fscl -fy" the snapshot to see of 
the opject come back or go to lost+found.  If they go to lost+found, we're 
considering replicating the MDT and MGT in a sandbox, mounting them and the OST 
and "lfsck"ing the OST to see if the MDT knows how to move the lost objects out 
of the lost+found to their happy places.

Are there any other troubleshooting steps we can take while we wait for the OST 
block device to be copied (that'll take a bit) for our test e2fsck?  Is there 
any output from the "tune2fs -l" or "e2fsck" that we can provide that could 
shed any light on the issue and provide possible solutions? Any other tips and 
tricks?  Thanks in advance for any insight.

Cheers
Scott

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] OST not recognized as lustre volume. - group descriptors corrupted

Reply via email to