Pak Lui wrote:
Orion Poplawski wrote:
In our setup (which I don't believe is very unique) the nodes are
connected by two networks: an "admin" network which allows for
connections from outside the cluster and an "MPI" network that is a
private GigE network connecting the nodes for MPI traffic:
+---------admin net (192.168.0.X)--------+
| | |
+-----------+ +--------+ +--------+
| SGE Master| | coop00 | | coop01 |
| | | coop00x| | coop01x|
+-----------+ +--------+ +--------+
| |
+------------+
MPI net (192.168.1.X)
So the "x" suffix names are the addresses on the MPI network.
Currently (loose integration), we create machines files like:
coop00x.cora.nwra.com cpu=2
coop01x.cora.nwra.com cpu=2
which makes the MPI traffic travel over the MPI network. I'm trying
to duplicate this under "tight" integration.
Well, this is what we did with LAM and I naively assumed that since
OpenMPI used that same machines file format, it worked the same there.
But once I finally read the FAQ (specifically:
<http://www.open-mpi.org/faq/?category=tcp#tcp-selection>) I see that it
works totally differently.
So, default setup with gridengine integration works, and I just have:
btl_tcp_if_include = eth1
in my /etc/openmpi-mca-params.conf file.
Sorry for all the confusion.
--
Orion Poplawski
System Administrator 303-415-9701 x222
NWRA/CoRA Division FAX: 303-415-9702
3380 Mitchell Lane or...@cora.nwra.com
Boulder, CO 80301 http://www.cora.nwra.com