Hi, I've seen a similar problem with Slurm on various kernels: http://bugs.schedmd.com/show_bug.cgi?id=1242
This is likely a kernel bug that has existed for a long time. I found a mailing list message from November of 2011 with similar problems: https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html I finally decided to just disable cgroup enforcement in slurm and use an alternate slurm method for killing jobs that go over the memory limit. I did not file a bug with redhat at the time. Andy On Mon, Mar 02, 2015 at 02:49:52PM +0100, Andreas Haupt wrote: > Hi Arnau, > > Am Montag, den 02.03.2015, 10:59 +0100 schrieb Arnau Bria: > > In our case the only option is downgrade. the bug affects any kind of > > node and is not predictable, so the only option (if we want to run > > newer kernel) is removing cgroups support. > > So in our case we can live with an old kernel version. > > As we encounter a race condition here obviously, I wonder if you could > find out some statistics. It is really just a small fraction of jobs > that are affected here. In our case it looks like the chance for a crash > is increased if more than 1 job finishes at some point in time. > > Do you observe something similar? > > Cheers, > Andreas > -- > | Andreas Haupt | E-Mail: [email protected] > | DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt > | Platanenallee 6 | Phone: +49/33762/7-7359 > | D-15738 Zeuthen | Fax: +49/33762/7-7216 -- andy wettstein hpc system administrator research computing center university of chicago 773.702.1104
