Yes, the clients appear to have recovered. I didn't want to risk an fsck until a new file level backup was completed --- this will take time given the size of our system.
I've done at least 5 or 6 raid rebuilds in the past without issue using these raid cards. We will try to isolate the cause to this problem further --- i.e. perhaps a bad batch of spare drives, buggy raid driver (I think this is a newer Lustre version), etc. Many thanks for your help. Zach. > It sounds like it is working better. Did the clients recover? I would have > re-run fsck before mounting it again, and moving the data off may still be > the best plan. Since dropping the rebuilt drive reduced the corruption, > certainly contact your raid vendor over this issue. > > Kevin > > > Zachary Beebleson wrote: >> Kevin, >> >> I just failed the drive and remounted. A basic 'df' hangs when it gets to >> the mount point, but /proc/fs/lustre/health_check reports everything is >> healthy. 'lfs df' on a client reports the OST is active, where it was >> inactive before. However, now I'm working with a degraded volume, but it >> is raid 6. Should I try another rebuild or just proceed with the >> mirgration off of this OST asap? >> >> Thanks, >> Zach >> >> PS. Sorry for the repeat message >> On Fri, 13 May 2011, Kevin Van Maren wrote: >> >> > See bug 24264 -- certainly possible that the raid controller corrupted >> > your filesystem. >> > >> > If you remove the new drive and reboot, does the file system look >> > cleaner? >> > >> > Kevin >> > >> > >> > On May 13, 2011, at 11:39 AM, Zachary Beebleson >> > <[email protected]> wrote: >> > >> > > >> > > We recently had two raid rebuilds on a couple storage targets that did >> > > not go >> > > according to plan. The cards reported a successful rebuild in each >> > > case, but >> > > ldiskfs errors started showing up on the associated OSSs and the >> > > effected OSTs >> > > were remounted read-only. We are planning to migrate off the data, >> > > but we've >> > > noticed that some clients are getting i/o errors, while others are >> > > not. As an >> > > example, a file that has a stripe on at least one affected OST could >> > > not be >> > > read on one client, i.e. I received a read-error trying to access it, >> > > while it >> > > was perfectly readable and apparently uncorrupted on another (I am >> > > able to >> > > migrate the file to healthy OSTs by copying to a new file name). The >> > > clients >> > > with the i/o problem see inactive devices corresponding to the >> > > read-only OSTs >> > > when I issue a 'lfs df', while the others without the i/o problems >> > > report the >> > > targets as normal. Is it just that many clients are not aware of an >> > > OST problem >> > > yet? I need clients with minimal I/O disruptions in order to migrate >> > > as much >> > > data off as possible. >> > > >> > > A client reboot appears to awaken them to the fact that there are >> > > problems with >> > > the OSTs. However, I need them to be able to read the data in order to >> > > migrate >> > > it off. Is there a way to reconnect the clients to the problematic >> > > OSTs? >> > > >> > > We have dd-ed copies of the OSTs to try e2fsck against them, but the >> > > results >> > > were not promising. The check aborted with: >> > > >> > > ------ >> > > Resize inode (re)creation failed: A block group is missing an inode >> > > table.Continue? yes >> > > >> > > ext2fs_read_inode: A block group is missing an inode table while >> > > reading inode >> > > 7 in recreate inode >> > > e2fsck: aborted >> > > ------ >> > > >> > > Any advice would be greatly appreciated. >> > > Zach >> > > _______________________________________________ >> > > Lustre-discuss mailing list >> > > [email protected] >> > > http://lists.lustre.org/mailman/listinfo/lustre-discuss >> > > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
