On 11/10/2012 12:38 AM, Lans Carstensen wrote:
Greetings!  On Fri, Nov 9, 2012 at 9:16 PM, Brian Bockelman
<[email protected]> wrote:

On Nov 9, 2012, at 11:49 PM, Dan Bradley <[email protected]> wrote:

Hi all

I was thinking about the lack of support in Condor for setting cpu affinity 
with partitionable slots in a useful way.  We do this on our non-partitionable 
slots to avoid the inevitable accidents where jobs try to use the whole machine 
from a single-core slot.  We'd like to be able to do it on the partitionable 
slots.

My first question is whether cpu shares in cgroups make the above use-case of 
cpu affinity obsolete.

Hi Dan,

Yeah, it's pretty fun watching users try to do this at our cluster.  They don't 
get particularly far, but they at least soak up any extra idle CPU cycles on 
the node.


If not, then it would be really nice to have cpu affinity working in the 
partitionable slot world.  The problem is that all dynamic slots under a single 
partitionable slot get assigned to the same set of cpus.  It seems to me that 
the startd needs to manage the partitioning of the cpu set when it creates the 
dynamic slots.

Are there plans for generic support for this sort of non-fungible resource 
partitioning?  Implementing this specific case does not sound very hard, as 
long as we (at least initially) just use a first-fit strategy and do not worry 
about optimizing which cores go best together in case of multi-core jobs.  I 
think it could even be done without adding any new configuration knobs (gasp).


How are you going to assign CPU sets?  The traditional syscall route or the 
cgroup route?  It strikes me that, if you go the cgroup route, you actually 
could repack processes on different cores later to optimize topology.

In the end, because we don't optimize core topology, I don't particularly think 
it is any better than using cgroups (in fact, slightly worse because you may 
hold cores unnecessarily idle, nor can you oversubscribe a node).  I guess it's 
something for people without cgroups?

How does this interact with cgroups?  Right now, we have:

MEMORY_LIMIT=[soft|hard|none]

for the policy of what to do when the job goes over the memory limit.  Maybe we 
also have:

CPU_LIMIT=[soft|hard|none]

?

Brian

So, you might also want to look to "numad" from RHEL 6.3+ and recent
Fedora's for inspiration:

http://git.fedorahosted.org/cgit/numad.git/

It uses cgroups and "fixes" longrunning processes (see
bind_process_and_migrate_memory() ), but also provides "numad -w
NCPUS[:MB]", which passes back an argument for use with numactl to
pack reservations in the best way possible.

-- Lans Carstensen

And for the optimization use case (the other case is protection/isolation via affinity / shares), numad can be used to actively manage a set of processes.

http://fedoraproject.org/wiki/Features/numad

Best,


matt

_______________________________________________
HTCondor-devel mailing list
[email protected]
https://lists.cs.wisc.edu/mailman/listinfo/htcondor-devel

Reply via email to