OK. This sounds sensible.
Thanks, David
On Mar 22, 2007, at 10:38 AM, Ralph Castain wrote:
We had a nice chat about this on the OpenRTE telecon this morning. The
question of what to do with multiple prefix's has been a long-
running issue,
most recently captured in bug trac report #497. The problem is that
prefix
is intended to tell us where to find the ORTE/OMPI executables, and
therefore is associated with a node - not an app_context. What we
haven't
been able to define is an appropriate notation that a user can
exploit to
tell us the association.
This issue has arisen on several occasions where either (a) users have
heterogeneous clusters with a common file system, so the prefix
must be
adjusted on each *type* of node to point to the correct type of
binary; and
(b) for whatever reason, typically on rsh/ssh clusters, users have
installed
the binaries in different locations on some of the nodes. In this
latter
case, the reports have been from homogeneous clusters, so the
*type* of
binary was never the issue - it just wasn't located where we expected.
Sun's solution is (I believe) what most of us would expect - they
locate
their executables in the same relative location on all their nodes.
The
binary in that location is correct for that local architecture. This
requires, though, that the "prefix" location not be on a common
file system.
Unfortunately, that isn't the case with LANL's roadrunner, nor can
we expect
that everyone will follow that sensible approach :-). So we need a
notation
to support the "exception" case where someone needs to truly
specify prefix
versus node(s).
We discussed a number of options, including auto-detecting the
local arch
and appending it to the specified "prefix" and several others. After
discussing them, those of us on the call decided that adding a
field to the
hostfile that specifies the prefix to use on that host would be the
best
solution. This could be done on a cluster-level basis, so -
although it is
annoying to create the data file - at least it would only have to
be done
once.
Again, this is the exception case, so requiring a little
inconvenience seems
a reasonable thing to do.
Anyone have heartburn and/or other suggestions? If not, we might
start to
play with this next week. We would have to do some small
modifications to
the RAS, RMAPS, and PLS components to ensure that any multi-prefix
info gets
correctly propagated and used across all platforms for consistent
behavior.
Ralph
On 3/22/07 9:11 AM, "David Daniel" <d...@lanl.gov> wrote:
This is a development system for roadrunner using ssh.
David
On Mar 22, 2007, at 5:19 AM, Jeff Squyres wrote:
FWIW, I believe that we had intended --prefix to handle simple cases
which is why this probably doesn't work for you. But as long as the
different prefixes are specified for different nodes, it could
probably be made to work.
Which launcher are you using this with?
On Mar 21, 2007, at 11:36 PM, Ralph Castain wrote:
Yo David
What system are you running this on? RoadRunner? If so, I can take
a look at
"fixing" it for you tomorrow (Thurs).
Ralph
On 3/21/07 10:17 AM, "David Daniel" <d...@lanl.gov> wrote:
I'm experimenting with heterogeneous applications (x86_64 <-->
ppc64), where the systems share the file system where Open MPI is
installed.
What I would like to be able to do is something like this:
mpirun --np 1 --host host-x86_64 --prefix /opt/ompi/x86_64
a.out.x86_64 : --np 1 --host host-ppc64 --prefix /opt/ompi/ppc64
a.out.ppc64
Unfortunately it looks as if the second --prefix is always
ignored.
My guess is that orte_app_context_t::prefix_dir is getting set,
but
only the 0th app context is never consulted (except in the dynamic
process stuff where I do see a loop over the app context array).
I can of course work around it with startup scripts, but a command
line solution would be attractive.
This is with openmpi-1.2.
Thanks, David
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
David Daniel <d...@lanl.gov>
Computer Science for High-Performance Computing (CCS-1)
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
David Daniel <d...@lanl.gov>
Computer Science for High-Performance Computing (CCS-1)