On 2010-11-09, at 03:07, Aurelien Degremont wrote: > Andreas Dilger a écrit : >>> Cold replace: >>> 1 - Empty your OST >>> 2 - Stop your filesystem >>> 3 - Replace/reformat using the same index >>> 4 - Restart using --writeconf >>> 5 - Remount the clients >> 6 - fix up the MDS's idea of the OST's last-allocated object. >>> Did I miss something ? >> Other than #6, it looks correct. > > How do you fix #6? What are the actions needed for that?
That is what is described in the rest of this email... >>> What is currently preventing, a freshly formatted OST with the same index, >>> to register itself properly (using first_time flag) to MGS and MDT when >>> remounting and: >>> - refreshing its CONFIG from MGS internal cache >>> - telling MDT to reset last_rcvd/LAST_ID it knows for this OST. >>> That way, we could have an easy way to hot replace an OST. >>> How do you think this can be achieved ? >> It probably wouldn't be impossible to have a new OST gracefully replace an >> old one, if that is what the administrator wanted. Some "special" action >> would need to be taken on the OST and/or MDT to ensure that this is what the >> admin wanted, instead of e.g. accidentally inserting some other OST with the >> same index and corrupting the filesystem because of duplicate object IDs, or >> not being able to access existing objects on the "real" OST at that index. >> - the new OST would be best off to start allocating objects at the LAST_ID >> of the old OST, so that there is no risk of confusion between objects >> - the MDT contains the old LAST_ID in it's lov_objids file, and it sends this >> to the OST at connection time, this is no problem >> - currently the new OST will refuse to allow the MDT to connect, because it >> detects that the old LAST_ID value from the MDT is inconsistent with its >> own value >> - it would be relatively straight forward to have the OST detect if the local >> LAST_ID value was "new" and use the MDT value instead > > Can we based this check on 'first_time' flag. > I mean, OST update its LAST_ID based on what MDT tell it only if it has the > 'first_time' flag set. The problem is that if the 'first_time' flag is always set on a new OST, then any OST accidentally claiming the same index (e.g. from a test filesystem of the same name, or from user error) could replace the valid OST. This 'first_time' flag could not be the default. >> - the danger is if the LAST_ID file was lost for some reason (e.g. corruption >> causes e2fsck to erase it). in that case, the OST startup code should be >> smart enough to regenerate LAST_ID based on walking the object directories, >> which would also avoid the need to do this in e2fsck/lfsck (which can only >> run offline) >> - in cases where the on-disk LAST_ID is much lower than the MDT-supplied >> value, the OST should just skip precreation of all the intermediate objects >> and just start using the new MDT value > > This seems a different feature, even if related, which is "Better handling of > LAST_ID corruption". Partly, yes. >> - the only other thing is to avoid the case where a "new" OST is accidentally >> assigned the same index, when that isn't what is wanted. There needs to be >> some way to "prime" the new OST (that is NOT the default for a newly >> formatted OST), or conversely tell the MDT that it should signal the new >> OST to take the place of the old one, so that there are not any mistakes > > Indeed, this is important. And if we want to have this supports online > replace. Another option when formatting OST? > --replace ? Which is only accepted when --index is set? Yes, that would probably be a good way to handle it from the user interface. The other question is how to handle this internally. Probably a flag stored in the mountinfo or last_rcvd file. >> Since this is something that has come up on this list a number of times in >> the last year, I guess it means that a Lustre filesystem is now outliving >> the hardware on which it runs, so it would definitely be worthwhile for >> someone to look at this. I filed bug 24128 on this, in case anyone wants to >> work on it. > > Can you also add it to Community project list? Done. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
