We have multiple compute clusters that mount each of our Lustre file systems and we do OS/kernel updates on them without regards to each other. Sometimes a client cluster may be updated at the same time as one of the Lustre clusters, but often it's not. This approach generally works fine and jobs/file-accesses will hang until recovery on the file system is finished.
On 2/27/19 8:05 AM, Carlson, Timothy S wrote: > I will say YMMV. I've rebooted storage nodes and have had mixed results > where we land into one of three bucktes > > 1) Codes breeze through and have just been stuck in D state while OSS's reboot > 2) RPCs get stuck somewhere and when the OSS comes back I eventually have to > force an abort_recovery > 3) A code dies by not handling the timeout (not sure if this is due to the > code itself or the client improperly handling the timeout) > > On our current setup with around 1000 clients, 50ish OSS, and 2.5.x vintage > lustre servers I would say option 1 is by far the largest percentage (>95). 2 > and 3 happen from time to time with likelihood greater than 0. > > It's always a best practice to take a scheduled outage for a kernel/version > upgrade. You never know what oddity your particular setup might encounter. > > Tim > > -----Original Message----- > From: lustre-discuss <[email protected]> On Behalf Of > Paul Edmon > Sent: Wednesday, February 27, 2019 7:54 AM > To: [email protected] > Subject: Re: [lustre-discuss] Rebooting storage nodes while jobs are running? > > From experience rebooting the storage nodes is fine, the processes > accessing them will just hang until restored. I've done this many times on > our cluster with no ill effect. > > That said I have not tried it with kernel upgrades or lustre release changes. > That may do something different and unexpected. Some one else on the list > may have insight on these. > > -Paul Edmon- > > On 2/27/19 10:17 AM, Bernd Melchers wrote: >> Hi all, >> our environment: CentOS-7.6, [email protected], 2 mds, 7 ods, 180 >> clients. >> >> Is it possible to reboot the mds and ods server (e.g. for new kernel >> or new lustre releases) without affecting running jobs on the client nodes? >> The reboot can take up to 15 minutes. Did the clients still wait for >> the storage nodes to reappear or will i/o operations get errors? >> Is the behaviour of a client influenced by the timeout parameter ( >> "lctl get_param timeout") or by other parameters? >> >> Mit freundlichen Grüßen >> Bernd Melchers >> > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > _______________________________________________ > lustre-discuss mailing list > [email protected] > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
