Working its way around the CMR process now.
Might be easier in the future if we could test/debug this in the
trunk, though. Otherwise, the CMR procedure will fall behind and a fix
might miss a release window.
Anyway, hopefully this one will make the 1.3.0 release cutoff.
Thanks
Ralph
On Dec 8, 2008, at 9:56 AM, Greg Watson wrote:
Hi Ralph,
This is now in 1.3rc2, thanks. However there are a couple of
problems. Here is what I see:
[Jarrah.watson.ibm.com:58957] <noderesolve name="node0"
resolved="Jarrah.watson.ibm.com">
For some reason each line is prefixed with "[...]", any idea why
this is? Also the end tag should be "/>" not ">".
Thanks,
Greg
On Nov 24, 2008, at 3:06 PM, Greg Watson wrote:
Great, thanks. I'll take a look once it comes over to 1.3.
Cheers,
Greg
On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote:
Yo Greg
This is in the trunk as of r20032. I'll bring it over to 1.3 in a
few days.
I implemented it as another MCA param
"orte_show_resolved_nodenames" so you can actually get the info as
you execute the job, if you want. The xml tag is "noderesolve" -
let me know if you need any changes.
Ralph
On Oct 22, 2008, at 11:55 AM, Greg Watson wrote:
Ralph,
I guess the issue for us is that we will have to run two commands
to get the information we need. One to get the configuration
information, such as version and MCA parameters, and one to get
the host information, whereas it would seem more logical that
this should all be available via some kind of "configuration
discovery" command. I understand the issue with supplying the
hostfile though, so maybe this just points at the need for us to
separate configuration information from the host information. In
any case, we'll work with what you think is best.
Greg
On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:
Hmmm...just to be sure we are all clear on this. The reason we
proposed to use mpirun is that "hostfile" has no meaning outside
of mpirun. That's why ompi_info can't do anything in this regard.
We have no idea what hostfile the user may specify until we
actually get the mpirun cmd line. They may have specified a
default-hostfile, but they could also specify hostfiles for the
individual app_contexts. These may or may not include the node
upon which mpirun is executing.
So the only way to provide you with a separate command to get a
hostfile<->nodename mapping would require you to provide us with
the default-hostifle and/or hostfile cmd line options just as if
you were issuing the mpirun cmd. We just wouldn't launch - but
it would be the exact equivalent of doing "mpirun --do-not-
launch".
Am I missing something? If so, please do correct me - I would be
happy to provide a tool if that would make it easier. Just not
sure what that tool would do.
Thanks
Ralph
On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:
Ralph,
It seems a little strange to be using mpirun for this, but
barring providing a separate command, or using ompi_info, I
think this would solve our problem.
Thanks,
Greg
On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:
Sorry for delay - had to ponder this one for awhile.
Jeff and I agree that adding something to ompi_info would not
be a good idea. Ompi_info has no knowledge or understanding of
hostfiles, and adding that capability to it would be a major
distortion of its intended use.
However, we think we can offer an alternative that might
better solve the problem. Remember, we now treat hostfiles in
a very different manner than before - see the wiki page for a
complete description, or "man orte_hosts".
So the problem is that, to provide you with what you want, we
need to "dump" the information from whatever default-hostfile
was provided, and, if no default-hostfile was provided, then
the information from each hostfile that was provided with an
app_context.
The best way we could think of to do this is to add another
mpirun cmd line option --dump-hostfiles that would output the
line-by-line name from the hostfile plus the name we resolved
it to. Of course, --xml would cause it to be in xml format.
Would that meet your needs?
Ralph
On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:
Hi Ralph,
We've been discussing this back and forth a bit internally
and don't really see an easy solution. Our problem is that
Eclipse is not running on the head node, so gethostbyname
will not necessarily resolve to the same address. For
example, the hostfile might refer to the head node by an
internal network address that is not visible to the outside
world. Since gethostname also looks in /etc/hosts, it may
resolve locally but not on a remote system. The only think I
can think of would be, rather than us reading the hostfile
directly as we do now, to provide an option to ompi_info that
would dump the hostfile using the same rules that you apply
when you're using the hostfile. Would that be feasible?
Greg
On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:
Sorry for delay - was on vacation and am now trying to work
my way back to the surface.
I'm not sure I can fix this one for two reasons:
1. In general, OMPI doesn't really care what name is used
for the node. However, the problem is that it needs to be
consistent. In this case, ORTE has already used the name
returned by gethostname to create its session directory
structure long before mpirun reads a hostfile. This is why
we retain the value from gethostname instead of allowing it
to be overwritten by the name in whatever allocation we are
given. Using the name in hostfile would require that I
either find some way to remember any prior name, or that I
tear down and rebuild the session directory tree - neither
seems attractive nor simple (e.g., what happens when the
user provides multiple entries in the hostfile for the node,
each with a different IP address based on another interface
in that node? Sounds crazy, but we have already seen it done
- which one do I use?).
2. We don't actually store the hostfile info anywhere - we
just use it and forget it. For us to add an XML attribute
containing any hostfile-related info would therefore require
us to re-read the hostfile. I could have it do that -only-
in the case of "XML output required", but it seems rather
ugly.
An alternative might be for you to simply do a
"gethostbyname" lookup of the IP address or hostname to see
if it matches instead of just doing a strcmp. This is what
we have to do internally as we frequently have problems with
FQDN vs. non-FQDN vs. IP addresses etc. If the local OS
hasn't cached the IP address for the node in question it can
take a little time to DNS resolve it, but otherwise works
fine.
I can point you to the code in OPAL that we use - I would
think something similar would be easy to implement in your
code and would readily solve the problem.
Ralph
On Sep 19, 2008, at 7:18 AM, Greg Watson wrote:
Ralph,
The problem we're seeing is just with the head node. If I
specify a particular IP address for the head node in the
hostfile, it gets changed to the FQDN when displayed in the
map. This is a problem for us as we need to be able to
match the two, and since we're not necessarily running on
the head node, we can't always do the same resolution
you're doing.
Would it be possible to use the same address that is
specified in the hostfile, or alternatively provide an XML
attribute that contains this information?
Thanks,
Greg
On Sep 11, 2008, at 9:06 AM, Ralph Castain wrote:
Not in that regard, depending upon what you mean by
"recently". The only changes I am aware of wrt nodes
consisted of some changes to the order in which we use the
nodes when specified by hostfile or -host, and a little
#if protectionism needed by Brian for the Cray port.
Are you seeing this for every node? Reason I ask: I can't
offhand think of anything in the code base that would
replace a host name with the FQDN because we don't get
that info for remote nodes. The only exception is the head
node (where mpirun sits) - in that lone case, we default
to the name returned to us by gethostname(). We do that
because the head node is frequently accessible on a more
global basis than the compute nodes - thus, the FQDN is
required to ensure that there is no address confusion on
the network.
If the user refers to compute nodes in a hostfile or -host
(or in an allocation from a resource manager) by non-FQDN,
we just assume they know what they are doing and the name
will correctly resolve to a unique address.
On Sep 10, 2008, at 9:45 AM, Greg Watson wrote:
Hi,
Has there been a change in the behavior of the -display-
map option has changed recently in the 1.3 branch. We're
now seeing the host name as a fully resolved DN rather
than the entry that was specified in the hostfile. Is
there any particular reason for this? If so, would it be
possible to add the hostfile entry to the output since we
need to be able to match the two?
Thanks,
Greg
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel