Hi Gabe,

On Apr 16, 2012, at 11:44 AM, Resneck, Gabriel M (388J) wrote:

> 
> To use Chris's words, when using the "fresh-out-of-the-box" version of the 
> RM, both of the concepts of Capacity and Load are entirely arbitrary.  

I'd clarify that while the default values set for these concepts are arbitrary, 
the concepts themselves are not. Capacity is used
by the AssignmentMonitor and is a core property of the ResourceNode class. 
Load, is leveraged by the AssignmentMonitor 
to determine the current business of one of the ResourceNodes.

> They have no relation to any kind of resources available on your node 
> machines.  

Well, again, the default out of the box values for these concepts don't, but 
the concepts themselves do.

> Therefore, if you give each job a load of 1 (regardless of the node resources 
> required to run the job) and if you give a node a capacity of 10, the RM will 
> try to always have 10 jobs running on that node.
>  It does nothing to track resource usage on the node, so use of such a 
> paradigm as the one that I just described could be wildly inefficient.

Let's clarify that again. Saying it *does nothing* kind of doesn't sound right 
to me. It *does* do something. It tracks how
much load is currently on a node, compared to its current capacity, and 
provides that information as-is to the Scheduler, 
which then in turn uses the information to determine a node "besting" algorithm 
to determine what node to select to 
Batch a job out to. So, it does *do something*. It's just that it's not 
real-time and more virtual profiling. And, let's be specific.
The XMLAssignmentMonitor decides how this information will be used and provided 
and tracked. This is just one 
potential implementation of the AssignmentMonitor RM extension point.

We could (and should) develop a Ganglia resource monitor that could leverage 
Ganglia information to plug in. And 
we could develop a TorqueAssignmentMonitor that uses qmon or something like it 
to parse the information out of 
Torque's queue. We could also connect in to Sun Grid Engine (SGE) or another 
DRM technology to get this
information too.


> Because these numbers are arbitrary, I recommend carefully investigating the 
> availability of resources on your nodes and setting load and capacity levels 
> using that information.  For example, if you find that your jobs tend to be 
> I/O bound when you have more than 3 running simultaneously on the same node, 
> then you could set your job load to 1 and the node capacity to 3.  If you 
> wanted more granularity, you could easily set the load to 33 and the capacity 
> to 100.  Since these numbers are entirely arbitrary, you have the freedom to 
> make such changes.  Obviously, not all jobs will be the same, so you may want 
> to assign different loads to different jobs and assign different capacities 
> to nodes based upon the resources that each makes available.

Exactly. And to add to that, you can group different jobs into different 
queues, and then queues to nodes, to control flow of jobs
onto those nodes, based on a "queue type".

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [email protected]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Reply via email to