Re: [Lustre-discuss] Serious problem with OSTs

Andreas Dilger Wed, 29 Dec 2010 22:48:45 -0800

On 2010-12-29, at 20:22, "Mervini, Joseph A" <[email protected]> wrote:
> 
> And examining the LUN with tunefs.lustre produces the following:
> 
> [r...@rio37 ~]# tunefs.lustre /dev/sdf
> checking for existing Lustre data: found last_rcvd
> tunefs.lustre: Unable to read 1.6 config /tmp/dirUvdBcz/mountdata.


That means the mountdata file is likely either missing or corrupted somehow. 

>   Read previous values:
> Target:     
> Index:      54
> UUID:       ostr)o37sdf_UID
> Lustre FS:  lustre
> Mount type: ldiskfs
> Flags:      0x202
>              (OST upgrade1.4 )
> Persistent mount opts: 
> Parameters:
> 
> I suspected that there were file system inconsistencies so I ran fsck on one 
> of the target and got a large number of errors, primarily "Multiply-claimed 
> blocks" running e2fsck -fp and when it completed the OS told me I needed to 
> run fsck manually which I did with the "-fy" options. This dumped a ton of 
> inodes to lost+found. In addition, when it started it converted the file 
> system from ext3 to ext2 during the fsck and then recreated the journal when 
> it completed.

There was some sort of device-level corruption in this case. The e2fsck fixed 
it as much as possible, and you should run ll_recover_lost_found_objs on the 
mounted filesystem. 

> However, I was still unable to mount the LUN and tunefs.lustre still had the 
> FATAL condition shown above.
> 
> I AM able to mount all of the LUNs as ldiskfs devices so I suspect that the 
> lustre config for those OSTs just got clobbered somehow. Also, looking at the 
> inodes that were dumped to lost+found, most of them have timestamps that are 
> more that a year old that by policy should have been purged so I'm wondering 
> if it is just an artifact of the file system not being checked for a very 
> long time.

That depends in atime, which is normally only updated on the MDS on disk. 

> Other things to note is the OSS is fiber channel attached to a DDN 9500 and 
> the OSTs that are having problems are associated with one controller of the 
> couplet. That is suspicious, but because neither controller is showing any 
> faults I suspect that whatever has occurred did not happen recently.  

It does seem to be the smoking gun.  

> In addition, the  /CONFIG/mountdata on all the targets originally had a 
> timestamp of Aug 3 14:05 (and still does for the targets that can't be 
> mounted). 
> 
> So I have two questions:
> 
> How can I restore the config data on the OSTs that are having problems?

I think there was a thread on rebuilding the mountdata file recently. 

> What does "Multiply-claimed blocks" mean and does it indicate corruption? 

Disk-level corruption. 

> I am afraid that running e2fsck may have compounded my problems and am 
> holding off on doing any file system checks on the other 2 target.

Well, it is needed at some point...
_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] Serious problem with OSTs

Reply via email to