Greetings:
Ralph brings up some good points here. I have a few thoughts/experiences.
First, I like the way things are behaving now. In fact, I take full
advantage
of the fact the different aliases for a node are treated as different nodes
to do some scalability testing. It is in this way that I fake out the
ORTE and
have it start multiple daemons on a node. (We had a similar feature in our
old ClusterTools runtime environment to get multiple daemons running
on a single node)
For example, I do this to get 4 orteds running on "alachua".
mpirun -np 4 -host alachua,alachua-1,alachua-2,alachua-3 hostname
All of the above resolve to the same IP address.
Secondly, I would not want us to make any change that negatively affects
scalability. If we do decide to make a change, then we need a flag to
revert to the original behaviour.
Lastly, I guess I have two questions.
1. Are you sure that Open MPI behaves in "unexpected ways?" This all
worked fine for me as I stated above.
2. Do you have any more details on the cost of "resolving every name"?
Which API is it that causes the problems? I only ask because I have
been trying to understand some of the NIS traffic I see when running
on my cluster.
Thanks,
Rolf
Ralph Castain wrote:
Yo all
A recent email thread on the devel list involved (in part) the question of
hostname resolution. [Note: I have a fix for the localhost problem described
in that thread - just need to chase down a memory corruption problem, so it
won't come into the trunk until next week]
This is a problem that has troubled us since the beginning, and we have gone
back-and-forth on solutions. Rather than just throwing another code change
into the system, Jeff and I thought it might be a good idea to seek input
from the community.
The problem is that our system requires a consistent way of identifying
nodes so we can tell if, for example, we already have a daemon on that node.
We currently do that via a string hostname. This appears to work just fine
in managed environments as the allocators are (usually?) consistent in how
they name a node.
However, users are frequently not consistent, which causes a problem. For
example, users can create a hostfile entry for "foo.bar.net", and then put
"-host foo" on their command line. In Open MPI, these will be treated as two
completely separate nodes.
In the past, we attempted to solve this by actually resolving every name
provided to us. However, resolving names of remote hosts can be a very
expensive function call, especially at scale. One solution we considered was
to only do this for non-managed environments - i.e., when provided names in
a hostfile or via -host. This was rejected on the grounds that it penalized
people who used those mechanisms and, in many cases, wasn't necessary
because users were careful to avoid ambiguity.
But that leaves us with an unsolved problem that can cause Open MPI to
behave in unexpected ways, including possibly hanging. Of course, we could
just check names for matches in that first network name field - this would
solve the "foo" vs "foo.bar.net" problem, but creates a vulnerability (what
if we have both "foo.bar.net" and "foo.no-bar.net" in our hostfile?) that
may or may not be acceptable (I'm sure it is at least uncommon for an MPI
app to cross subnet boundaries, but maybe someone is really doing this in
some rsh-based cluster).
Or we could go back to fully resolving names provided via non-managed
channels. Or we just tell people that "you *must* be consistent in how you
identify nodes". Or....?
Any input would be appreciated.
Ralph
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel