On Jun 03, 2008 16:37 -0400, Charles Taylor wrote: > I'm sorry, I should have updated you. You are right, it was > misleading. The MDS/MDT was fine and after about twenty minutes or > so everything became active and we now have a working file system with > data that we can access so we can't *thank you* enough.
You're welcome. > BTW, That's a pretty obscure "fix". I was going to ask for an > explanation but we've been pretty busy doing fsck's and lfsck's (which > we are still working up to since it takes a while to generate the > db's). It is a pretty slow process but things are looking > relatively good. Of course, when you go from thinking you just lost > all your data to having almost all of it, anything looks pretty > good. :) > > PS - we know refer to your commands to truncate the last_rcvd file as > the "Dilger Procedure" (with great reverence). :) Well, by no means should this be a normal process. If you can spare the time after your system is back in shape, then copying the last_rcvd.sav file to a test MDT and mounting it with a serial console enabled would help track down what the root cause of this is. The fewer people that have to perform the "Dilger Procedure" the better. > On Jun 3, 2008, at 4:20 PM, Andreas Dilger wrote: > > On Jun 02, 2008 19:51 -0400, Charles Taylor wrote: > >> Wow, you are one powerful witch doctor. So we rebuilt our > >> system disk > >> (just to be sure) and that made no difference we still panicked as > >> soon as > >> mounted the MDT. The "-o abort_recov" did not help either. > >> However, > >> your recipe below worked wonders....almost. Now we can mount > >> the MDT > >> but it does not go into recovery. It just shows as > >> "inactive". We > >> are so close, I can taste it but what are we doing wrong now? > >> > >> > >> [EMAIL PROTECTED] lustre]# cat /proc/fs/lustre/mds/ufhpc-MDT0000/ > >> recovery_status > >> status: INACTIVE > >> > >> > >> Which tire do we kick now? :) > > > > Well, deleting the tail of the last_rcvd file is the "hard" way to > > tell > > the MDT/OST it is no longer in recovery... The deleted part of the > > file > > is where the per-client state is kept, so when it is removed the MDT > > decides no recovery is needed. > > > > The "recovery_status" being "INACTIVE" is somewhat misleading. It > > means > > "no recovery is currently active", but the MDT is up and you should be > > able to use it, with the caveat that clients previously doing > > operations > > will get an IO error for in-flight operations before they start > > afresh... > > However, you said the clients are powered off, so they probably aren't > > busy doing anything... > > > > If you had a more complete stack trace it would be useful to determine > > what is actually going wrong with the mount. > > > >> On Jun 2, 2008, at 3:36 PM, Andreas Dilger wrote: > >>> If mounting with "-o abort_recovery" doesn't solve the problem, > >>> are you able to mount the MDT filesystem as "-t ldiskfs" instead of > >>> "-t lustre"? Try that, then copy and truncate the last_rcvd file: > >>> > >>> mount -t ldiskfs /dev/MDSDEV /mnt/mds > >>> cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav > >>> cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav > >>> dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1 > >>> umount /mnt/mds > >>> > >>> mount -t lustre /dev/MSDDEV /mnt/mds > >>> > >>> Cheers, Andreas > >>> -- > >>> Andreas Dilger > >>> Sr. Staff Engineer, Lustre Group > >>> Sun Microsystems of Canada, Inc. > >>> > > > > Cheers, Andreas > > -- > > Andreas Dilger > > Sr. Staff Engineer, Lustre Group > > Sun Microsystems of Canada, Inc. > > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
