Thanks for the reply, Andreas.
Andreas Dilger wrote:
> I would start by simply trying to mount the OST filesystem with ldiskfs
> directly (mount options "-o ro" to avoid any further corruption or
> errors, and possibly also "noload" to avoid recovering the journal), and
> seeing if you can copy out the data from the filesystem into a backup
> filesystem, and then just reformat the OST.
Unfortunately, this did not work:
[r...@tebow2 ~]# mount -t ldiskfs -o ro /dev/F3P1L0/T2-F3P1L0 /mnt
mount: wrong fs type, bad option, bad superblock on /dev/F3P1L0/T2-F3P1L0,
missing codepage or other error
In some cases useful info is found in syslog - try
dmesg | tail or so
In dmesg I see this:
LDISKFS-fs error (device dm-7): ldiskfs_check_descriptors: Checksum for
group 256 failed (18306!=0)
LDISKFS-fs: group descriptors corrupted!
Adding "noload" to the options list did not change anything.
> You should copy out the files with a tool that has xattr support, like
> rsync v3, or the RHEL tar using the --xattr option.
>
> Failing that, you may be able to e2fsck using a backup superblock and
> group descriptor with the "-B 4096 -b {blocknr}", where:
>
> blocknr = 32768 * {3,5,7}^n
>
> I don't think the first backup group descriptor is valid (that would be
> n=0 above, or 32768), so you could try (at random) 32768 * 3^2 = 294912.
I tried fsck with from the 1.41.6 Lustre package with the '-p' option
with several values of n and all three values {3,5,7}. Nearly all
attempts look like this one - the same block is complained about
*almost* every time:
[r...@tebow2 ~]# fsck -b 294912 -B 4096 -f -p /dev/F3P1L0/T2-F3P1L0
fsck 1.41.6.sun1 (30-May-2009)
crn-OST0011: Block bitmap for group 6016 is not in group. (block 484237063)
Seems that particular groups get complained about, FWIW, 6016 and 10112.
However, with n=1 and 7 as the multiplier, the fsck -p output was a bit
different (different block, zeroed some checksums for group descriptors)
- am trying an fsck with that superblock and "-y" now.
> If you can get it mounted at all you should copy the data out. If you
> have a very new kernel you may be able to mount the filesystem with ext4
> (so that you don't need to re-create the journal) to copy the data out.
>
> For the objects in the lost+found directory ll_recover_lost_found_objs
> will "rescue" all of these objects and put them back into the right
> directory structure for Lustre to find them again.
Hopefully we can get it mounted and rescue the data.
We appreciate your help.
Thanks,
Craig Prescott
UF HPC Center
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss