Re: 2.6.32-504.1.3.el6.x86_64 and cgroup BUG ?

Andy Wettstein Wed, 11 Mar 2015 07:49:21 -0700

Hi,

I've seen a similar problem with Slurm on various kernels:
http://bugs.schedmd.com/show_bug.cgi?id=1242


This is likely a kernel bug that has existed for a long time. I found a
mailing list message from November of 2011 with similar problems:
https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html

I finally decided to just disable cgroup enforcement in slurm and use an
alternate slurm method for killing jobs that go over the memory limit.

I did not file a bug with redhat at the time.

Andy


On Mon, Mar 02, 2015 at 02:49:52PM +0100, Andreas Haupt wrote:
> Hi Arnau,
> 
> Am Montag, den 02.03.2015, 10:59 +0100 schrieb Arnau Bria:
> > In our case the only option is downgrade. the bug affects any kind of
> > node and is not predictable, so the only option (if we want to run
> > newer kernel) is removing cgroups support.
> > So in our case we can live with an old kernel version.
> 
> As we encounter a race condition here obviously, I wonder if you could
> find out some statistics. It is really just a small fraction of jobs
> that are affected here. In our case it looks like the chance for a crash
> is increased if more than 1 job finishes at some point in time.
> 
> Do you observe something similar?
> 
> Cheers,
> Andreas
> -- 
> | Andreas Haupt            | E-Mail: [email protected]
> |  DESY Zeuthen            | WWW:    http://www-zeuthen.desy.de/~ahaupt
> |  Platanenallee 6         | Phone:  +49/33762/7-7359
> |  D-15738 Zeuthen         | Fax:    +49/33762/7-7216

-- 
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104

Re: 2.6.32-504.1.3.el6.x86_64 and cgroup BUG ?

Reply via email to