On 2010-11-09, at 03:07, Aurelien Degremont wrote:
> Andreas Dilger a écrit :
>>> Cold replace:
>>> 1 - Empty your OST
>>> 2 - Stop your filesystem
>>> 3 - Replace/reformat using the same index
>>> 4 - Restart using --writeconf
>>> 5 - Remount the clients
>> 6 - fix up the MDS's idea of the OST's last-allocated object.
>>> Did I miss something ?
>> Other than #6, it looks correct.
> 
> How do you fix #6?  What are the actions needed for that?

That is what is described in the rest of this email...

>>> What is currently preventing, a freshly formatted OST with the same index, 
>>> to register itself properly (using first_time flag) to MGS and MDT when 
>>> remounting and:
>>> - refreshing its CONFIG from MGS internal cache
>>> - telling MDT to reset last_rcvd/LAST_ID it knows for this OST.
>>> That way, we could have an easy way to hot replace an OST.
>>> How do you think this can be achieved ?
>> It probably wouldn't be impossible to have a new OST gracefully replace an 
>> old one, if that is what the administrator wanted.  Some "special" action 
>> would need to be taken on the OST and/or MDT to ensure that this is what the 
>> admin wanted, instead of e.g. accidentally inserting some other OST with the 
>> same index and corrupting the filesystem because of duplicate object IDs, or 
>> not being able to access existing objects on the "real" OST at that index.
>> - the new OST would be best off to start allocating objects at the LAST_ID
>>  of the old OST, so that there is no risk of confusion between objects
>> - the MDT contains the old LAST_ID in it's lov_objids file, and it sends this
>>  to the OST at connection time, this is no problem
>> - currently the new OST will refuse to allow the MDT to connect, because it
>>  detects that the old LAST_ID value from the MDT is inconsistent with its
>>  own value
>> - it would be relatively straight forward to have the OST detect if the local
>>  LAST_ID value was "new" and use the MDT value instead
> 
> Can we based this check on 'first_time' flag.
> I mean, OST update its LAST_ID based on what MDT tell it only if it has the 
> 'first_time' flag set.

The problem is that if the 'first_time' flag is always set on a new OST, then 
any OST accidentally claiming the same index (e.g. from a test filesystem of 
the same name, or from user error) could replace the valid OST.  This 
'first_time' flag could not be the default.

>> - the danger is if the LAST_ID file was lost for some reason (e.g. corruption
>>  causes e2fsck to erase it).  in that case, the OST startup code should be
>>  smart enough to regenerate LAST_ID based on walking the object directories,
>>  which would also avoid the need to do this in e2fsck/lfsck (which can only
>>  run offline)
>> - in cases where the on-disk LAST_ID is much lower than the MDT-supplied
>>  value, the OST should just skip precreation of all the intermediate objects
>>  and just start using the new MDT value
> 
> This seems a different feature, even if related, which is "Better handling of 
> LAST_ID corruption".

Partly, yes.

>> - the only other thing is to avoid the case where a "new" OST is accidentally
>>  assigned the same index, when that isn't what is wanted.  There needs to be
>>  some way to "prime" the new OST (that is NOT the default for a newly
>>  formatted OST), or conversely tell the MDT that it should signal the new
>>  OST to take the place of the old one, so that there are not any mistakes
> 
> Indeed, this is important. And if we want to have this supports online 
> replace. Another option when formatting OST?
> --replace ? Which is only accepted when --index is set?

Yes, that would probably be a good way to handle it from the user interface.  
The other question is how to handle this internally.  Probably a flag stored in 
the mountinfo or last_rcvd file.

>> Since this is something that has come up on this list a number of times in 
>> the last year, I guess it means that a Lustre filesystem is now outliving 
>> the hardware on which it runs, so it would definitely be worthwhile for 
>> someone to look at this.  I filed bug 24128 on this, in case anyone wants to 
>> work on it.
> 
> Can you also add it to Community project list?

Done.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to