Greetings Patrick. Many thanks for the detailed run-down; sorry I
didn't reply earlier.
This is quite definitely a known problem, and I'm pretty sure we have
an open ticket on it (I'm on a plane right now and can't check the
web-based bug tracker). We have a solution in mind for the issue,
but it hadn't been done yet mainly because it hadn't bubbled up high
enough in priority / no one had the time to code it up.
How high of a priority is the ability to re-home an OMPI installation
for you?
On Dec 8, 2006, at 8:53 AM, Patrick Jessee wrote:
Hello. For OpenMPI 1.1.2, I've come across a situation where the --
prefix syntax does not seem to be working. I've investigated the
issue by stepping through the mpirun startup in a debugger. Below
is a summary of the problem and details about the investigation
(along with a prospective fix).
Summary of problem
===============
When starting a openMPI run with the --prefix option, the MPI
application does not start up correctly in certain situations. An
important point is that this problem behavior is masked (and not
seen) if the openMPI libraries are available at the compile/install-
time location defined by OPAL_PKGLIBDIR (defined in opal/include/
opal/install_dirs.h). So in debugging the problem, it is important
to move the openMPI installation from the installed location, and
then set the --prefix value to the new location. In addition,
LD_LIBRARY_PATH needs to be set to the new location so mpirun can
find liborte.so and libopal.so at program load time (--prefix can't
help mpirun with liborte.so and libopal.so because (a) these libs
are dynamically linked into mpirun and are needed at program load
time, and (b) the --prefix arg isn't processed until after load
time. Thus LD_LIBRARY_PATH is needed for mpirun, but this is
tangential).
The behavior that is see is the following output:
----------------------------------------------------------------------
----
It looks like orte_init failed for some reason; your parallel
process is
likely to abort. There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems. This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):
orte_sds_base_select failed
--> Returned value -13 instead of ORTE_SUCCESS
:
:
----------------------------------------------------------------------
----
Open RTE was unable to initialize properly. The error occurred while
attempting to orte_init(). Returned value -13 instead of
ORTE_SUCCESS.
----------------------------------------------------------------------
----
Investigation of the problem
===================
As mentioned before, I've looked at mpirun in the debugger. The
instance of mpirun (and the MPI app) find the dynamically linked
libraries (liborte.so, libopal.so) just fine, but they do not
locate the dynamically loaded ones (the ones in lib/openmpi such as
mca_paffinity_linux.so, etc.). The --prefix directory does not
seem to be getting used to open the libraries in lib/openmpi.
It appears that the location to search is getting set in
mca_base_open.c around line 68 (1.1.2):
asprintf(&value, "%s:~/.openmpi/components", OPAL_PKGLIBDIR);
mca_base_param_component_path =
mca_base_param_reg_string_name("mca", "component_path",
"Path where to look for Open MPI
and ORTE components",
false, false, value, NULL);
Here, OPAL_PKGLIBDIR is a fixed, compile-time location. It appears
that the --prefix directory (actually <prefix_dir>/lib/openmpi)
needs to be appended, if not prepended, to the component_path.
Alternatively, the static OPAL_PKGLIBDIR directory could just be
replaced by the runtime value of <prefix_dir>/lib/openmpi.
I've compiled in a quick fix to libopal.so to see if the approach
addressed the issue. I didn't see how to get access to the --
prefix directory at this point, so I just prepended genenv
("LD_LIBRARY_PATH") to "value" and added <prefix_dir>/lib/openmpi
to LD_LIBRARY_PATH before starting the app (note: this is just a
way for verifying that if the --prefix directory was used here, it
would address the issue; this is not a proposed solution. The
<prefix_dir>/lib/openmpi should be used directly). Anyway, this
fixed the issue and the application was able so start.
In applying this fix, I also found that is was not only important
for mca_base_param_component_path to include the <prefix_dir>/lib/
openmpi directory in the instances of mpirun and the MPI app, but
also in all instances of orted before they dynamically load libraries.
----
In summary, it seems that this issue can be resolved by applying
the --prefix directory (<prefix_dir>/lib/openmpi) to
mca_base_param_component_path in instances of mpirun, orted, and
the MPI app.
Any help in getting this fix implemented in the code base would be
very much appreciated, and I'll be happy to provide any more
information or help.
Regards,
Patrick
P.S. Even with the fix, a (non-fatal) message is printed. It's
probably a tangential issue, but thought it was worth mentioning.
Again, the --prefix directory probably needs to be used somewhere
in place of a static directory. The message is:
----------------------------------------------------------------------
----
Sorry! You were supposed to get help about:
rds:no-hostfile
from the file:
help-rds-hostfile.txt
But I couldn't find any file matching that name. Sorry!
----------------------------------------------------------------------
----
<pj.vcf>
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems