I'm sure that Yuri is right about the corner-case complexity across all linux 
and Spectrum/GPFS versions.

In situations where lots of outstanding tokens exist, and there are few token 
managers, we have seen the assassination of a large footprint mmfsd in GPFS 4.1 
seem to impact entire clusters, potentially due to serialization in recovery of 
so many tokens, and overlapping access among nodes. We're looking forward to 
fixes in 4.2.1 to address some of this too.

But for what it's worth, on RH6/7 with 4.1, we have seen the end of OOM 
impacting GPFS since implementing the callback. One item I forgot is that we 
don't set it to -500, but to OOM_SCORE_ADJ_MIN, which on our systems is -1000. 
That causes the heuristic oom_badness to return the lowest possible score, more 
thoroughly immunizing it against selection.

Thx
Paul

Sent with Good Work (www.good.com)


From: Yuri L Volobuev <[email protected]<mailto:[email protected]>>
Date: Tuesday, May 24, 2016, 12:17 PM
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5


This problem is more complex than it may seem. The thing is, mmfsd runs as 
root, as thus already possesses a certain amount of natural immunity to OOM 
killer. So adjusting mmfsd oom_score_adj doesn't radically change the ranking 
of OOM killer victims, only tweaks it. The way things are supposed to work is: 
a user process eats up a lot of memory, and once a threshold is hit, OOM killer 
picks off the memory hog, and the memory is released. Unprivileged processes 
inherently have a higher OOM score, and should be killed off first. If that 
doesn't work, for some reason, the OOM killer gets desperate and starts going 
after root processes. Once things get to this point, it's tough. If you somehow 
manage to spare mmfsd per se, what's going to happen next? The OOM killer still 
needs a victim. What we've seen happen in such a situation is semi-random 
privileged process killing. mmfsd stays alive, but various other system 
processes are picked off, and pretty quickly the node is a basket case. A Linux 
node is not very resilient to random process killing. And it doesn't help that 
those other privileged processes usually don't use much memory, so killing them 
doesn't release much, and the carnage keeps on going. The real problem is: why 
wasn't the non-privileged memory hog process killed off first, before root 
processes became fair game? This is where things get pretty complicated, and 
depend heavily on the Linux version. There's one specific issue that did get 
diagnosed. If a process is using mmap and has page faults going that result in 
GPFS IO, on older versions of GPFS the process would fail to error out after a 
SIGKILL, due to locking complications spanning Linux kernel VMM and GPFS mmap 
code. This means the OOM killer would attempt to kill a process, but that 
wouldn't produce the desired result (the process is still around), and the OOM 
killer keeps moving down the list. This problem has been fixed in the current 
GPFS service levels. It is possible that a similar problem may exist that 
prevents a memory hog process from erroring out. I strongly encourage opening a 
PMR to investigate such a situation, instead of trying to work around it 
without understanding why mmfsd was targeted in the first place.

This is the case of prevention being the best cure. Where we've seen success is 
customers using cgroups to prevent user processes from running a node out of 
memory in the first place. This has been shown to work well. Dealing with the 
fallout from running out of memory is a much harder task.

The post-mmfsd-kill symptoms that are described in the original note are not 
normal. If an mmfsd process is killed, other nodes will become aware of this 
fact faily quickly, and the node is going to be expelled from the cluster (yes, 
expels *can* be a good thing). In the normal case, TCP/IP sockets are closed as 
soon as mmfsd is killed, and other nodes immediately receive TCP RST packets, 
and close their connection endpoints. If the worst case, if a node just becomes 
catatonic, but RST is not sent out, the troubled node is going to be expelled 
from the cluster after about 2 minutes of pinging (in a default configuration). 
There should definitely not be a permanent hang that necessitates a manual 
intervention. Again, older versions of GPFS had no protection against surprise 
OOM thread kills, but in the current code some self-monitoring capabilities 
have been added, and a single troubled node won't have a lasting impact on the 
cluster. If you aren't running with a reasonably current level of GPFS 3.5 
service, I strongly recommend upgrading. If you see the symptoms originally 
described with the current code, that's a bug that we need to fix, so please 
open a PMR to address the issue.

yuri

[Inactive hide details for "Sanchez, Paul" ---05/24/2016 07:33:18 AM---Hi 
Peter, This is mentioned explicitly in the Spectrum Sc]"Sanchez, Paul" 
---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in the 
Spectrum Scale docs (http://www.ibm.com/support/knowle

From: "Sanchez, Paul" <[email protected]>
To: gpfsug main discussion list <[email protected]>,
Date: 05/24/2016 07:33 AM
Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
Sent by: [email protected]

________________________________



Hi Peter,

This is mentioned explicitly in the Spectrum Scale docs 
(http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en)
 as a problem for the admin to consider, and many of us have been bitten by 
this. There are references going back at least to GPFS 3.1 in 2008 on 
developerworks complaining about this situation.

While the answer you described below is essentially what we do as well, I would 
argue that this is a problem which IBM should just own and fix for everyone. I 
cannot think of a situation in which you would want GPFS to be sacrificed on a 
node due to out-of-memory conditions, and I have seen several terrible 
consequences of this, including loss of cached, user-acknowledged writes.

I don't think there are any real gotchas. But in addition, our own 
implementation also:

* uses "--event preStartup" instead of "startup", since it runs earlier and 
reduces the risk of a race

* reads the score back out and complains if it hasn't been set

* includes "set -e" to ensure that errors will terminate the script and return 
a non-zero exit code to the callback parent

Thx
Paul

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Peter Childs
Sent: Tuesday, May 24, 2016 10:01 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5

Hi All,

We have an issue where the Linux kills off GPFS first when a computer runs out 
of memory, this happens when user processors have exhausted memory and swap and 
the out of memory killer in Linux kills the GPFS daemon as the largest user of 
memory, due to its large pinned memory foot print.

We have an issue where the Linux kills off GPFS first when a computer runs out 
of memory. We are running GPFS 3.5

We believe this happens when user processes have exhausted memory and swap and 
the out of memory killer in Linux chooses to kill the GPFS daemon as the 
largest user of memory, due to its large pinned memory footprint.

This means that GPFS is killed and the whole cluster blocks for a minute before 
it resumes operation, this is not ideal, and kills and causes issues with most 
of the cluster.

What we see is users unable to login elsewhere on the cluster until we have 
powered off the node. We believe this is because while the node is still 
pingable, GPFS doesn't expel it from the cluster.

This issue mainly occurs on our frontend nodes of our HPC cluster but can 
effect the rest of the cluster when it occurs.

This issue mainly occurs on the login nodes of our HPC cluster but can affect 
the rest of the cluster when it occurs.

I've seen others on list with this issue.

We've come up with a solution where by the gpfs is adjusted so that is unlikely 
to be the first thing to be killed, and hopefully the user process is killed 
and not GPFS.

We've come up with a solution to adjust the OOM score of GPFS, so that it is 
unlikely to be the first thing to be killed, and hopefully the OOM killer picks 
a user process instead.

Out testing says this solution works, but I'm asking here firstly to share our 
knowledge and secondly to ask if there is anything we've missed with this 
solution and issues with this.

We've tested this and it seems to work. I'm asking here firstly to share our 
knowledge and secondly to ask if there is anything we've missed with this 
solution.

Its short which is part of its beauty.

/usr/local/sbin/gpfs-oom_score_adj

<pre>
#!/bin/bash

for proc in $(pgrep mmfs); do
echo -500 >/proc/$proc/oom_score_adj done </pre>

This can then be called automatically on GPFS startup with the following:

<pre>
mmaddcallback startupoomkiller --command /usr/local/sbin/gpfs-oom_score_adj 
--event startup </pre>

and either restart gpfs or just run the script on all nodes.

Peter Childs
ITS Research Infrastructure
Queen Mary, University of London
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to