[Lustre-discuss] Serious problem with OSTs

Mervini, Joseph A Wed, 29 Dec 2010 19:22:52 -0800

Hi,

Over the past weekend we encountered several problems with hardware on one of 
our scratch file systems. One fallout from the failures was that one of our 
clusters was no longer able to access the file system through the routers. On 
closer examination I encountered IO errors when using lctl ping to one of the 
OSS servers. Looking at the OSS dmesg showed problem with routers being 
unavailable (they were restarted earlier in this troubleshooting exercise). I 
shutdown lustre but encountered problems on when trying to restart it. Three of 
the OSTs would not mount. I rebooted the system and encountered the same 
problems.


So when I tried to mount the OST I get the following:

[r...@rio37 ~]# mount -t lustre /dev/sdf /mnt
mount.lustre: mount /dev/sdf at /mnt failed: No such file or directory
Is the MGS specification correct?
Is the filesystem name correct?
If upgrading, is the copied client log valid? (see upgrade docs)

And examining the LUN with tunefs.lustre produces the following:

[r...@rio37 ~]# tunefs.lustre /dev/sdf
checking for existing Lustre data: found last_rcvd
tunefs.lustre: Unable to read 1.6 config /tmp/dirUvdBcz/mountdata.
Trying 1.4 config from last_rcvd
Reading last_rcvd
Feature compat=2, incompat=0

   Read previous values:
Target:     
Index:      54
UUID:       ostr)o37sdf_UID
Lustre FS:  lustre
Mount type: ldiskfs
Flags:      0x202
              (OST upgrade1.4 )
Persistent mount opts: 
Parameters:


tunefs.lustre FATAL: Must specify --mgsnode=
tunefs.lustre: exiting with 22 (Invalid argument)

When compared with a valid target on the same node it is obvious that it is 
screwed up:

[r...@rio37 ~]# tunefs.lustre /dev/sdd
checking for existing Lustre data: found CONFIGS/mountdata
Reading CONFIGS/mountdata

   Read previous values:
Target:     scratch1-OST0074
Index:      116
UUID:       ostrio37sdd_UUID
Lustre FS:  scratch1
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.196.135...@o2ib1


   Permanent disk data:
Target:     scratch1-OST0074
Index:      116
UUID:       ostrio37sdd_UUID
Lustre FS:  scratch1
Mount type: ldiskfs
Flags:      0x2
              (OST )
Persistent mount opts: errors=remount-ro,extents,mballoc
Parameters: mgsnode=10.196.135...@o2ib1

Writing CONFIGS/mountdata

I suspected that there were file system inconsistencies so I ran fsck on one of 
the target and got a large number of errors, primarily "Multiply-claimed 
blocks" running e2fsck -fp and when it completed the OS told me I needed to run 
fsck manually which I did with the "-fy" options. This dumped a ton of inodes 
to lost+found. In addition, when it started it converted the file system from 
ext3 to ext2 during the fsck and then recreated the journal when it completed. 
However, I was still unable to mount the LUN and tunefs.lustre still had the 
FATAL condition shown above.

I AM able to mount all of the LUNs as ldiskfs devices so I suspect that the 
lustre config for those OSTs just got clobbered somehow. Also, looking at the 
inodes that were dumped to lost+found, most of them have timestamps that are 
more that a year old that by policy should have been purged so I'm wondering if 
it is just an artifact of the file system not being checked for a very long 
time.

Other things to note is the OSS is fiber channel attached to a DDN 9500 and the 
OSTs that are having problems are associated with one controller of the 
couplet. That is suspicious, but because neither controller is showing any 
faults I suspect that whatever has occurred did not happen recently.  In 
addition, the  /CONFIG/mountdata on all the targets originally had a timestamp 
of Aug 3 14:05 (and still does for the targets that can't be mounted). 

So I have two questions:

How can I restore the config data on the OSTs that are having problems?

What does "Multiply-claimed blocks" mean and does it indicate corruption? I am 
afraid that running e2fsck may have compounded my problems and am holding off 
on doing any file system checks on the other 2 target.

Thanks very much for your help in advance.

 
==
 
Joe Mervini
Sandia National Laboratories
Dept 09326
PO Box 5800 MS-0823
Albuquerque NM 87185-0823
 


_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

[Lustre-discuss] Serious problem with OSTs

Reply via email to