Hi Brian,

I don’t *think* you can entirely solve this problem with Moab … as I mentioned, 
it’s not nearly as efficient as SLURM is at killing jobs when they exceed 
requested memory.  We had situations where a user would be able to run a node 
out of memory before Moab would kill it.  Hasn’t happened once with SLURM, 
AFAIK.

But with either Moab or SLURM what we’ve done is taken the amount of physical 
RAM in the box and subtracted from that the amount of memory we want to 
“reserve” for the system (OS, GPFS, etc.) and then told Moab / SLURM that this 
is how much RAM the box has.  That way they at least won’t schedule jobs on the 
node that would exceed available memory.

HTH…

Kevin

On Dec 20, 2016, at 11:07 AM, Brian Marshall 
<[email protected]<mailto:[email protected]>> wrote:

We use adaptive - Moab torque right now but are thinking about going to Skyrim

Brian

On Dec 20, 2016 11:38 AM, "Buterbaugh, Kevin L" 
<[email protected]<mailto:[email protected]>> wrote:
Hi Brian,

It would be helpful to know what scheduling software, if any, you use.

We were a PBS / Moab shop for a number of years but switched to SLURM two years 
ago.  With both you can configure the maximum amount of memory available to all 
jobs on a node.  So we just simply “reserve” however much we need for GPFS and 
other “system” processes.

I can tell you that SLURM is *much* more efficient at killing processes as soon 
as they exceed the amount of memory they’ve requested than PBS / Moab ever 
dreamed of being.

Kevin

On Dec 20, 2016, at 10:27 AM, Skylar Thompson 
<[email protected]<mailto:[email protected]>> wrote:

We're a Grid Engine shop, and use cgroups (m_mem_free) to control user process 
memory
usage. In the GE exec host configuration, we reserve 4GB for the OS
(including GPFS) so jobs are not able to consume all the physical memory on
the system.

On Tue, Dec 20, 2016 at 11:25:04AM -0500, Brian Marshall wrote:
All,

What is your favorite method for stopping a user process from eating up all
the system memory and saving 1 GB (or more) for the GPFS / system
processes?  We have always kicked around the idea of cgroups but never
moved on it.

The problem:  A user launches a job which uses all the memory on a node,
which causes the node to be expelled, which causes brief filesystem
slowness everywhere.

I bet this problem has already been solved and I am just googling the wrong
search terms.


Thanks,
Brian

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


--
-- Skylar Thompson ([email protected]<mailto:[email protected]>)
-- Genome Sciences Department, System Administrator
-- Foege Building S046, (206)-685-7354<tel:(206)%20685-7354>
-- University of Washington School of Medicine
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633<tel:(615)%20875-9633>




_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
[email protected]<mailto:[email protected]> - 
(615)875-9633



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to