Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Chris Samuel
Hi Prentice!

On Tuesday, 12 June 2018 4:11:55 AM AEST Prentice Bisbal wrote:

> I to make this work, I will be using job_submit.lua to apply this logic
> and assign a job to a partition. If a user requests a specific partition
> not in line with these specifications, job_submit.lua will reassign the
> job to the appropriate QOS.

Yeah, that's very much like what we do for GPU jobs (redirect them to the 
partition with access to all cores, and ensure non-GPU jobs go to the 
partition with fewer cores) via the submit filter at present..

I've already coded up something similar in Lua for our submit filter (that only 
affects my jobs for testing purposes) but I still need to handle memory 
correctly, in other words only pack jobs when the per-task memory request * 
tasks per node < node RAM (for now we'll let jobs where that's not the case go 
through to the keeper for Slurm to handle as now).

However, I do think Scott's approach is potentially very useful, by directing 
jobs < full node to one end of a list of nodes and jobs that want full nodes 
to the other end of the list (especially if you use the partition idea to 
ensure that not all nodes are accessible to small jobs).

cheers!
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Chris Samuel
On Sunday, 10 June 2018 1:48:18 AM AEST Skylar Thompson wrote:

> Unfortunately we don't have a mechanism to limit
> network usage or local scratch usage

Our trick in Slurm is to use the slurmdprolog script to set an XFS project
quota for that job ID on the per-job directory (created by a plugin which
also makes subdirectories there that it maps to /tmp and /var/tmp for the
job) on the XFS partition used for local scratch on the node.

If they don't request an amount via the --tmp= option then they get a default
of 100MB.  Snipping the relevant segments out of our prolog...

JOBSCRATCH=/jobfs/local/slurm/${SLURM_JOB_ID}.${SLURM_RESTART_COUNT}

if [ -d ${JOBSCRATCH} ]; then
QUOTA=$(/apps/slurm/latest/bin/scontrol show JobId=${SLURM_JOB_ID} | 
egrep MinTmpDiskNode=[0-9] | awk -F= '{print $NF}')
if [ "${QUOTA}" == "0" ]; then
QUOTA=100M
fi
/usr/sbin/xfs_quota -x -c "project -s -p ${JOBSCRATCH} ${SLURM_JOB_ID}" 
/jobfs/local
/usr/sbin/xfs_quota -x -c "limit -p bhard=${QUOTA} ${SLURM_JOB_ID}" 
/jobfs/local

Hope that is useful!

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Fwd: Project Natick

2018-06-11 Thread Lux, Jim (337K)
Oh, the radiation dose rate on the surface (variously given as 1-100 Rad/second 
(0.01-1 Gy/s), or 5.4 Sv/day (which are orders of magnitude different) means 
the first "jupiter rise" would be spectactular, and then you'd die.  5 Sv (500 
rem) is pretty much a lethal dose.




On 6/10/18, 5:16 AM, "Beowulf on behalf of Chris Samuel" 
 wrote:

On Saturday, 9 June 2018 6:25:33 AM AEST Lux, Jim (337K) wrote:

> If you want a cluster computer at Europa, you need reliability and remote
> maintainability

I suspect you could probably find some volunteers for on-site work... ;-)

-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Prentice Bisbal

Chris,

I'm dealing with this problem myself right now. We use Slurm here. We 
really have one large, very heterogeneous cluster that's treated as 
multiple smaller clusters through creating multiple partitions, each 
with their own QOS. We also have some users who don't understand the 
difference between -n and -N when specifying a job size. This has lead 
to jobs specified with -N to stay in the queue for an unusually long 
time. Yes, part of the solution is definitely user education, but there  
are still times when a user user should required nodes and not tasks 
(using OpenMP within a node, etc.)


Here's how I'm going to tackle this problem: Most of our nodes are 
32-cores, but some older nodes still in use are 16-core, so we're going 
to make sure that jobs going to our larger partitions request a multiple 
of 16 tasks. That way, a job will either occupy whole nodes, or leave 
1/2 a node available.


We have one partition meant for single-node or smaller jobs. That 
partition has only Ethernet, since it shouldn't be supporting inter-node 
jobs. On that partition, jobs can use 16-cores or less.


I to make this work, I will be using job_submit.lua to apply this logic 
and assign a job to a partition. If a user requests a specific partition 
not in line with these specifications, job_submit.lua will reassign the 
job to the appropriate QOS.


I'll be happy to share how this works after it's been in place for a few 
months.



On 06/08/2018 03:21 AM, Chris Samuel wrote:

Hi all,

I'm curious to know what/how/where/if sites do to try and reduce the impact of
fragmentation of resources by small/narrow jobs on systems where you also have
to cope with large/wide parallel jobs?

For my purposes a small/narrow job is anything that will fit on one node
(whether a single core job, multi-threaded or MPI).

One thing we're considering is to use overlapping partitions in Slurm to have
a subset of nodes that are available to these types of jobs and then have
large parallel jobs use a partition that can access any node.

This has the added benefit of letting us set a higher priority on that
partition to let Slurm try and place those jobs first, before smaller ones.

We're already using a similar scheme for GPU jobs where they get put into a
partition that can access all 36 cores on a node whereas non-GPU jobs get put
into a partition that can only access 32 cores on a node, so effectively we
reserve 4 cores a node for GPU jobs.

But really I'm curious to know what people do about this, or do you not worry
about it at all and just let the scheduler do its best?

All the best,
Chris


___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Fwd: Project Natick

2018-06-11 Thread Prentice Bisbal


On 06/10/2018 08:15 AM, Chris Samuel wrote:

On Saturday, 9 June 2018 6:25:33 AM AEST Lux, Jim (337K) wrote:


If you want a cluster computer at Europa, you need reliability and remote
maintainability

I suspect you could probably find some volunteers for on-site work... ;-)


I know some people I'd like to volunteer for that. ;-)

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Skylar Thompson
On Mon, Jun 11, 2018 at 02:36:14PM +0200, John Hearns via Beowulf wrote:
> Skylar Thomson wrote:
> >Unfortunately we don't have a mechanism to limit
> >network usage or local scratch usage, but the former is becoming less of a
> >problem with faster edge networking, and we have an opt-in bookkeeping
> mechanism
> >for the latter that isn't enforced but works well enough to keep people
> happy.
> That is interesting to me. At ASML I worked on setting up Quality of
> Service, ie bandwidth limits, for GPFS storage and MPI traffic.
> GPFS does have QoS limits inbuilt, but these are intended to limit the
> backgrouns housekeeping tasks rather than to limit user processes.
> But it does have the concept.
> With MPI you can configure different QoS levels for different traffic.
> 
> More relevently I did have a close discussion with Parav Pandit who is
> working on the network QoS stuff.
> I am sure there is something more up to date than this
> https://www.openfabrics.org/images/eventpresos/2016presentations/115rdmacont.pdf
> Sadly this RDMA stuff needs a recent 4-series kernel. I guess the
> discussion on whether or not you should go with a bleeding edge kernel is
> for another time!
> But yes cgroups have configurable network limits with the latest kernels.
> 
> Also being cheeky, and I probably have mentioned them before, here is a
> plug for Ellexus https://www.ellexus.com/
> Worth mentioning I have no connection with them!

Thanks for the pointer to Ellexus - their I/O profiling does look like
something that could be useful for us. Since we're a bioinformatics shop
and mostly storage-bound rather than network-bound, we haven't really
needed to worry about node network limitations (though occassionally have
had to worry about ToR or chassis switch limitations), but have really
suffered at times when people assume that disk performance is limitless,
and random access is the same as sequential access.

-- 
Skylar
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread John Hearns via Beowulf
Skylar Thomson wrote:
>Unfortunately we don't have a mechanism to limit
>network usage or local scratch usage, but the former is becoming less of a
>problem with faster edge networking, and we have an opt-in bookkeeping
mechanism
>for the latter that isn't enforced but works well enough to keep people
happy.
That is interesting to me. At ASML I worked on setting up Quality of
Service, ie bandwidth limits, for GPFS storage and MPI traffic.
GPFS does have QoS limits inbuilt, but these are intended to limit the
backgrouns housekeeping tasks rather than to limit user processes.
But it does have the concept.
With MPI you can configure different QoS levels for different traffic.

More relevently I did have a close discussion with Parav Pandit who is
working on the network QoS stuff.
I am sure there is something more up to date than this
https://www.openfabrics.org/images/eventpresos/2016presentations/115rdmacont.pdf
Sadly this RDMA stuff needs a recent 4-series kernel. I guess the
discussion on whether or not you should go with a bleeding edge kernel is
for another time!
But yes cgroups have configurable network limits with the latest kernels.

Also being cheeky, and I probably have mentioned them before, here is a
plug for Ellexus https://www.ellexus.com/
Worth mentioning I have no connection with them!
___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf


Re: [Beowulf] Avoiding/mitigating fragmentation of systems by small jobs?

2018-06-11 Thread Chris Samuel
On Sunday, 10 June 2018 10:33:22 PM AEST Scott Atchley wrote:

[lists]
> Yes. It may be specific to Cray/Moab.
 
No, I think that applies quite nicely to Slurm too.

> Good luck. If you want to discuss, please do not hesitate to ask. We have
> another paper pending along the same lines.

Thanks!  Much appreciated.

All the best,
Chris
-- 
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC

___
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf