I'm sorry, I should have updated you. You are right, it was misleading. The MDS/MDT was fine and after about twenty minutes or so everything became active and we now have a working file system with data that we can access so we can't *thank you* enough.
BTW, That's a pretty obscure "fix". I was going to ask for an explanation but we've been pretty busy doing fsck's and lfsck's (which we are still working up to since it takes a while to generate the db's). It is a pretty slow process but things are looking relatively good. Of course, when you go from thinking you just lost all your data to having almost all of it, anything looks pretty good. :) Thanks again for your help, Charlie Taylor UF HPC Center PS - we know refer to your commands to truncate the last_rcvd file as the "Dilger Procedure" (with great reverence). :) ct On Jun 3, 2008, at 4:20 PM, Andreas Dilger wrote: > On Jun 02, 2008 19:51 -0400, Charles Taylor wrote: >> Wow, you are one powerful witch doctor. So we rebuilt our >> system disk >> (just to be sure) and that made no difference we still panicked as >> soon as >> mounted the MDT. The "-o abort_recov" did not help either. >> However, >> your recipe below worked wonders....almost. Now we can mount >> the MDT >> but it does not go into recovery. It just shows as >> "inactive". We >> are so close, I can taste it but what are we doing wrong now? >> >> >> [EMAIL PROTECTED] lustre]# cat /proc/fs/lustre/mds/ufhpc-MDT0000/ >> recovery_status >> status: INACTIVE >> >> >> Which tire do we kick now? :) > > Well, deleting the tail of the last_rcvd file is the "hard" way to > tell > the MDT/OST it is no longer in recovery... The deleted part of the > file > is where the per-client state is kept, so when it is removed the MDT > decides no recovery is needed. > > The "recovery_status" being "INACTIVE" is somewhat misleading. It > means > "no recovery is currently active", but the MDT is up and you should be > able to use it, with the caveat that clients previously doing > operations > will get an IO error for in-flight operations before they start > afresh... > However, you said the clients are powered off, so they probably aren't > busy doing anything... > > If you had a more complete stack trace it would be useful to determine > what is actually going wrong with the mount. > >> On Jun 2, 2008, at 3:36 PM, Andreas Dilger wrote: >>> If mounting with "-o abort_recovery" doesn't solve the problem, >>> are you able to mount the MDT filesystem as "-t ldiskfs" instead of >>> "-t lustre"? Try that, then copy and truncate the last_rcvd file: >>> >>> mount -t ldiskfs /dev/MDSDEV /mnt/mds >>> cp /mnt/mds/last_rcvd /mnt/mds/last_rcvd.sav >>> cp /mnt/mds/last_rcvd /tmp/last_rcvd.sav >>> dd if=/mnt/mds/last_rcvd.sav of=/mnt/mds/last_rcvd bs=8k count=1 >>> umount /mnt/mds >>> >>> mount -t lustre /dev/MSDDEV /mnt/mds >>> >>> Cheers, Andreas >>> -- >>> Andreas Dilger >>> Sr. Staff Engineer, Lustre Group >>> Sun Microsystems of Canada, Inc. >>> > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc. > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
