On Wed, 11 Mar 2015 09:46:37 -0500 Andy Wettstein wrote: > Hi, Hi Andy, > I've seen a similar problem with Slurm on various kernels: > http://bugs.schedmd.com/show_bug.cgi?id=1242
this is not the same issue as we are seeing: - In our case the system reboots. - I see it when many jobs finish at the same time, not when jobs finish one by one. the cgroups thing has been working until last kernel upgrade. > This is likely a kernel bug that has existed for a long time. I found > a mailing list message from November of 2011 with similar problems: > https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html well, in my case it works perfectly with "old" kernel 2.6.32-431.29.2.el6.x86_64, so seems that something has been fixed since 2011. > I finally decided to just disable cgroup enforcement in slurm and use > an alternate slurm method for killing jobs that go over the memory > limit. I use cgroups no only for limiting the memory usage, I like the resource usage isolation (cpusets). > I did not file a bug with redhat at the time. Seems that RH accepted Andrea's bug, so seems that there is something wrong there. > Andy Cheers, Arnau
