Re: [OMPI devel] Change in hostfile behavior

Ralph Castain Mon, 28 Jul 2008 14:00:27 -0400

Actually, this is true today regardless of this change. If twoseparate mpirun invocations share a node and attempt to use paffinity,they will conflict with each other. The problem isn't caused by thehostfile sub-allocation. The problem is that the two mpiruns have noknowledge of each other's actions, and hence assign node ranks to eachprocess independently.

Thus, we would have two procs that think they are node rank=0 andshould therefore bind to the 0 processor, and so on up the line.

Obviously, if you run within one mpirun and have two app_contexts, thehostfile sub-allocation is fine - mpirun will track node rank acrossthe app_contexts. It is only the use of multiple mpiruns that sharenodes that causes the problem.

Several of us have discussed this problem and have a proposed solutionfor 1.4. Once we get past 1.3 (someday!), we'll bring it to the group.



On Jul 28, 2008, at 10:44 AM, Tim Mattox wrote:

My only concern is how will this interact with PLPA.
Say two Open MPI jobs each use "half" the cores (slots) on a
particular node...  how would they be able to bind themselves to
a disjoint set of cores?  I'm not asking you to solve this Ralph, I'm
just pointing it out so we can maybe warn users that if both jobssharinga node try to use processor affinity, we don't make that magicallywork well,
and that we would expect it to do quite poorly.

I could see disabling paffinity and/or warning if it was enabled for
one of these "fractional" nodes.

On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain <[email protected]> wrote:
Per an earlier telecon, I have modified the hostfile behaviorslightly to
allow hostfiles to subdivide allocations.
Briefly: given an allocation, we allow users to specify --hostfileon aper-app_context basis. In this mode, the hostfile info is used tofilter the
nodes that will be used for that app_context. However, the prior
implementation only filtered the nodes themselves - i.e., it was abinary
filter that allowed you to include or exclude an entire node.
The change now allows you to include a specified #slots for a givennode as
opposed to -all- slots from that node. You are limited to the #slots
included in the original allocation. I just realized that I hadn'toutput awarning if you attempt to violate this condition - will do soshortly.Rather than just abort if this happens, I set the allocation tothat of the
original - please let me know if you would prefer it to abort.
If you have interest in this behavior, please check it out and letme know
if this meets needs.

Ralph

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
[email protected] || [email protected]
I'm a bright... http://www.the-brights.net/
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Change in hostfile behavior

Reply via email to