Not now, but in a previous role, we would specifically increase the oom score on computer processes on our cluster that could consume a large amount of ram, trying to protect system processes. Once did this we had 0 system processes die.
On 25 May 2016 at 17:00, Sanchez, Paul <[email protected]> wrote: > I'm sure that Yuri is right about the corner-case complexity across all > linux and Spectrum/GPFS versions. > > In situations where lots of outstanding tokens exist, and there are few > token managers, we have seen the assassination of a large footprint mmfsd > in GPFS 4.1 seem to impact entire clusters, potentially due to > serialization in recovery of so many tokens, and overlapping access among > nodes. We're looking forward to fixes in 4.2.1 to address some of this too. > > But for what it's worth, on RH6/7 with 4.1, we have seen the end of OOM > impacting GPFS since implementing the callback. One item I forgot is that > we don't set it to -500, but to OOM_SCORE_ADJ_MIN, which on our systems is > -1000. That causes the heuristic oom_badness to return the lowest possible > score, more thoroughly immunizing it against selection. > > Thx > Paul > > Sent with Good Work (www.good.com) > > > *From: *Yuri L Volobuev <[email protected]> > *Date: *Tuesday, May 24, 2016, 12:17 PM > *To: *gpfsug main discussion list <[email protected]> > *Subject: *Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5 > > This problem is more complex than it may seem. The thing is, mmfsd runs as > root, as thus already possesses a certain amount of natural immunity to OOM > killer. So adjusting mmfsd oom_score_adj doesn't radically change the > ranking of OOM killer victims, only tweaks it. The way things are supposed > to work is: a user process eats up a lot of memory, and once a threshold is > hit, OOM killer picks off the memory hog, and the memory is released. > Unprivileged processes inherently have a higher OOM score, and should be > killed off first. If that doesn't work, for some reason, the OOM killer > gets desperate and starts going after root processes. Once things get to > this point, it's tough. If you somehow manage to spare mmfsd per se, what's > going to happen next? The OOM killer still needs a victim. What we've seen > happen in such a situation is semi-random privileged process killing. mmfsd > stays alive, but various other system processes are picked off, and pretty > quickly the node is a basket case. A Linux node is not very resilient to > random process killing. And it doesn't help that those other privileged > processes usually don't use much memory, so killing them doesn't release > much, and the carnage keeps on going. The real problem is: why wasn't the > non-privileged memory hog process killed off first, before root processes > became fair game? This is where things get pretty complicated, and depend > heavily on the Linux version. There's one specific issue that did get > diagnosed. If a process is using mmap and has page faults going that result > in GPFS IO, on older versions of GPFS the process would fail to error out > after a SIGKILL, due to locking complications spanning Linux kernel VMM and > GPFS mmap code. This means the OOM killer would attempt to kill a process, > but that wouldn't produce the desired result (the process is still around), > and the OOM killer keeps moving down the list. This problem has been fixed > in the current GPFS service levels. It is possible that a similar problem > may exist that prevents a memory hog process from erroring out. I strongly > encourage opening a PMR to investigate such a situation, instead of trying > to work around it without understanding why mmfsd was targeted in the first > place. > > This is the case of prevention being the best cure. Where we've seen > success is customers using cgroups to prevent user processes from running a > node out of memory in the first place. This has been shown to work well. > Dealing with the fallout from running out of memory is a much harder task. > > The post-mmfsd-kill symptoms that are described in the original note are > not normal. If an mmfsd process is killed, other nodes will become aware of > this fact faily quickly, and the node is going to be expelled from the > cluster (yes, expels *can* be a good thing). In the normal case, TCP/IP > sockets are closed as soon as mmfsd is killed, and other nodes immediately > receive TCP RST packets, and close their connection endpoints. If the worst > case, if a node just becomes catatonic, but RST is not sent out, the > troubled node is going to be expelled from the cluster after about 2 > minutes of pinging (in a default configuration). There should definitely > not be a permanent hang that necessitates a manual intervention. Again, > older versions of GPFS had no protection against surprise OOM thread kills, > but in the current code some self-monitoring capabilities have been added, > and a single troubled node won't have a lasting impact on the cluster. If > you aren't running with a reasonably current level of GPFS 3.5 service, I > strongly recommend upgrading. If you see the symptoms originally described > with the current code, that's a bug that we need to fix, so please open a > PMR to address the issue. > > yuri > > [image: Inactive hide details for "Sanchez, Paul" ---05/24/2016 07:33:18 > AM---Hi Peter, This is mentioned explicitly in the Spectrum Sc]"Sanchez, > Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in > the Spectrum Scale docs (http://www.ibm.com/support/knowle > > From: "Sanchez, Paul" <[email protected]> > To: gpfsug main discussion list <[email protected]>, > Date: 05/24/2016 07:33 AM > Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5 > Sent by: [email protected] > ------------------------------ > > > > Hi Peter, > > This is mentioned explicitly in the Spectrum Scale docs ( > http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en) > as a problem for the admin to consider, and many of us have been bitten by > this. There are references going back at least to GPFS 3.1 in 2008 on > developerworks complaining about this situation. > > While the answer you described below is essentially what we do as well, I > would argue that this is a problem which IBM should just own and fix for > everyone. I cannot think of a situation in which you would want GPFS to > be sacrificed on a node due to out-of-memory conditions, and I have seen > several terrible consequences of this, including loss of cached, > user-acknowledged writes. > > I don't think there are any real gotchas. But in addition, our own > implementation also: > > * uses "--event preStartup" instead of "startup", since it runs earlier > and reduces the risk of a race > > * reads the score back out and complains if it hasn't been set > > * includes "set -e" to ensure that errors will terminate the script and > return a non-zero exit code to the callback parent > > Thx > Paul > > -----Original Message----- > From: [email protected] [ > mailto:[email protected] > <[email protected]>] On Behalf Of Peter Childs > Sent: Tuesday, May 24, 2016 10:01 AM > To: gpfsug main discussion list > Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5 > > Hi All, > > We have an issue where the Linux kills off GPFS first when a computer runs > out of memory, this happens when user processors have exhausted memory and > swap and the out of memory killer in Linux kills the GPFS daemon as the > largest user of memory, due to its large pinned memory foot print. > > We have an issue where the Linux kills off GPFS first when a computer runs > out of memory. We are running GPFS 3.5 > > We believe this happens when user processes have exhausted memory and swap > and the out of memory killer in Linux chooses to kill the GPFS daemon as > the largest user of memory, due to its large pinned memory footprint. > > This means that GPFS is killed and the whole cluster blocks for a minute > before it resumes operation, this is not ideal, and kills and causes issues > with most of the cluster. > > What we see is users unable to login elsewhere on the cluster until we > have powered off the node. We believe this is because while the node is > still pingable, GPFS doesn't expel it from the cluster. > > This issue mainly occurs on our frontend nodes of our HPC cluster but can > effect the rest of the cluster when it occurs. > > This issue mainly occurs on the login nodes of our HPC cluster but can > affect the rest of the cluster when it occurs. > > I've seen others on list with this issue. > > We've come up with a solution where by the gpfs is adjusted so that is > unlikely to be the first thing to be killed, and hopefully the user process > is killed and not GPFS. > > We've come up with a solution to adjust the OOM score of GPFS, so that it > is unlikely to be the first thing to be killed, and hopefully the OOM > killer picks a user process instead. > > Out testing says this solution works, but I'm asking here firstly to share > our knowledge and secondly to ask if there is anything we've missed with > this solution and issues with this. > > We've tested this and it seems to work. I'm asking here firstly to share > our knowledge and secondly to ask if there is anything we've missed with > this solution. > > Its short which is part of its beauty. > > /usr/local/sbin/gpfs-oom_score_adj > > <pre> > #!/bin/bash > > for proc in $(pgrep mmfs); do > echo -500 >/proc/$proc/oom_score_adj done </pre> > > This can then be called automatically on GPFS startup with the following: > > <pre> > mmaddcallback startupoomkiller --command > /usr/local/sbin/gpfs-oom_score_adj --event startup </pre> > > and either restart gpfs or just run the script on all nodes. > > Peter Childs > ITS Research Infrastructure > Queen Mary, University of London > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
