Hi Gabe, On Apr 16, 2012, at 11:44 AM, Resneck, Gabriel M (388J) wrote:
> > To use Chris's words, when using the "fresh-out-of-the-box" version of the > RM, both of the concepts of Capacity and Load are entirely arbitrary. I'd clarify that while the default values set for these concepts are arbitrary, the concepts themselves are not. Capacity is used by the AssignmentMonitor and is a core property of the ResourceNode class. Load, is leveraged by the AssignmentMonitor to determine the current business of one of the ResourceNodes. > They have no relation to any kind of resources available on your node > machines. Well, again, the default out of the box values for these concepts don't, but the concepts themselves do. > Therefore, if you give each job a load of 1 (regardless of the node resources > required to run the job) and if you give a node a capacity of 10, the RM will > try to always have 10 jobs running on that node. > It does nothing to track resource usage on the node, so use of such a > paradigm as the one that I just described could be wildly inefficient. Let's clarify that again. Saying it *does nothing* kind of doesn't sound right to me. It *does* do something. It tracks how much load is currently on a node, compared to its current capacity, and provides that information as-is to the Scheduler, which then in turn uses the information to determine a node "besting" algorithm to determine what node to select to Batch a job out to. So, it does *do something*. It's just that it's not real-time and more virtual profiling. And, let's be specific. The XMLAssignmentMonitor decides how this information will be used and provided and tracked. This is just one potential implementation of the AssignmentMonitor RM extension point. We could (and should) develop a Ganglia resource monitor that could leverage Ganglia information to plug in. And we could develop a TorqueAssignmentMonitor that uses qmon or something like it to parse the information out of Torque's queue. We could also connect in to Sun Grid Engine (SGE) or another DRM technology to get this information too. > Because these numbers are arbitrary, I recommend carefully investigating the > availability of resources on your nodes and setting load and capacity levels > using that information. For example, if you find that your jobs tend to be > I/O bound when you have more than 3 running simultaneously on the same node, > then you could set your job load to 1 and the node capacity to 3. If you > wanted more granularity, you could easily set the load to 33 and the capacity > to 100. Since these numbers are entirely arbitrary, you have the freedom to > make such changes. Obviously, not all jobs will be the same, so you may want > to assign different loads to different jobs and assign different capacities > to nodes based upon the resources that each makes available. Exactly. And to add to that, you can group different jobs into different queues, and then queues to nodes, to control flow of jobs onto those nodes, based on a "queue type". Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
