[lustre-discuss] Draining and replacing OSTs with larger volumes

Scott Wood Wed, 27 Feb 2019 18:09:49 -0800

Hi folks,

Big upgrade process in the works and I had some questions.  Our current 
infrastructure has 5 HA pairs of OSSs and arrays with an HA pair of management 
and metadata servers who also share an array, all running lustre 2.10.3.  
Pretty standard stuff.  Our upgrade plan is as follows:


1) Deploy a new HA pair of OSSs with arrays populated with OSTs that are twice 
the size of our originals.
2) Follow the process in section 14.9 of the lustre docs to drain all OSTs in 
one of existing the HA pairs' arrays
3) Repopulate the first old pair of deactivated and drained arrays with new 
larger drives
4) Upgrade the offline OSSs from 2.10.3 to 2.10.latest?
5) Return them to service
6) Repeat steps 2-4 for the other 4 old HA pairs of OSSs and OSTs

I'd expect this would be doable without downtime as we'd only be taking arrays 
offline that have no objects on them, and we've added new arrays and OSSs 
before with no issues.  I have a few questions before we begin the process:

1) My interpretation of the docs is that  we OK to install them with 2.10.6 (or 
2.10.7, if it's out), as rolling upgrades withing X.Y are supported.  Is that 
correct?
2) Until the whole process is complete, we'll have imbalanced OSTs.  I know 
that's not ideal, but is it all that big an issue?
3) When draining the OSTs of files, section 14.9.3, point 2.a. states that the 
lfs find |lfs migrate can take multiple OSTs as args, but I thought it would be 
better to run one instance of that per OST and distribute them across multiple 
clients .  Is that reasonable (and faster)?
4) When the drives are replaced with bigger ones, can the original OST 
configuration files be restored to them as described in Docs section 14.9.5, or 
due the the size mismatch, will that be bad?
5) What questions should I be asking that I haven't thought of?

If that all goes well, and we did upgrade the OSSs to a newer 2.10.x, we'd 
follow it up with a migration of the MGT and MDT to one of the management 
servers, upgrade the other, fail them back, upgrade the second, and rebalance 
the MDT and MGT services back across the two.  We'd expect the usual pause in 
services as those migrate but other than that, fingers crossed, should all be 
good.  Are we missing anything?

Cheers
Scott

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

[lustre-discuss] Draining and replacing OSTs with larger volumes

Reply via email to