Re: [lustre-discuss] Rebooting storage nodes while jobs are running?

Harr, Cameron Fri, 01 Mar 2019 10:58:12 -0800

We have multiple compute clusters that mount each of our Lustre file 
systems and we do OS/kernel updates on them without regards to each 
other. Sometimes a client cluster may be updated at the same time as one 
of the Lustre clusters, but often it's not. This approach generally 
works fine and jobs/file-accesses will hang until recovery on the file 
system is finished.


On 2/27/19 8:05 AM, Carlson, Timothy S wrote:
> I will say YMMV.  I've rebooted storage nodes and have had mixed results 
> where we land into one of three bucktes
>
> 1) Codes breeze through and have just been stuck in D state while OSS's reboot
> 2) RPCs get stuck somewhere and when the OSS comes back I eventually have to 
> force an abort_recovery
> 3) A code dies by not handling the timeout (not sure if this is due to the 
> code itself or the client improperly handling the timeout)
>
> On our current setup with around 1000 clients, 50ish OSS, and 2.5.x vintage 
> lustre servers I would say option 1 is by far the largest percentage (>95). 2 
> and 3 happen from time to time with likelihood greater than 0.
>
> It's always a best practice to take a scheduled outage for a kernel/version 
> upgrade. You never know what oddity your particular setup might encounter.
>
> Tim
>
> -----Original Message-----
> From: lustre-discuss <[email protected]> On Behalf Of 
> Paul Edmon
> Sent: Wednesday, February 27, 2019 7:54 AM
> To: [email protected]
> Subject: Re: [lustre-discuss] Rebooting storage nodes while jobs are running?
>
>   From experience rebooting the storage nodes is fine, the processes 
> accessing them will just hang until restored.  I've done this many times on 
> our cluster with no ill effect.
>
> That said I have not tried it with kernel upgrades or lustre release changes. 
>  That may do something different and unexpected. Some one else on the list 
> may have insight on these.
>
> -Paul Edmon-
>
> On 2/27/19 10:17 AM, Bernd Melchers wrote:
>> Hi all,
>> our environment: CentOS-7.6, [email protected], 2 mds, 7 ods, 180 
>> clients.
>>
>> Is it possible to reboot the mds and ods server (e.g. for new kernel
>> or new lustre releases) without affecting running jobs on the client nodes?
>> The reboot can take up to 15 minutes. Did the clients still wait for
>> the storage nodes to reappear or will i/o operations get errors?
>> Is the behaviour of a client influenced by the timeout parameter (
>> "lctl get_param timeout") or by other parameters?
>>
>> Mit freundlichen Grüßen
>> Bernd Melchers
>>
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> _______________________________________________
> lustre-discuss mailing list
> [email protected]
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Rebooting storage nodes while jobs are running?

Reply via email to