On Jun 26, 2007, at 5:06 PM, Georg Wassen wrote:
Hello all,
I temporarily worked around my former problem by using synchronous
communication and shifting the initialization
into the first call of a collective operation.
But nevertheless, I found a performance bug in btl_openib.
When I execute the attached sendrecv.c on 4 (or more) nodes of a
Pentium D Cluster with Infinniband,
each receiving process gets only 8 messages in some seconds and
then does nothing for at least 20 sec.
(I executed the following command and hit Ctrl-C 20 sec. after the
last output)
This sounds like it could be a progression issue. When the openib
BTL is used by itself, we crank the frequency of the file descriptor
progression engine down very low because most progression will come
from verbs (not select/poll). I wonder if this is somehow related.
FWIW: the reason you have to use PML_CALL() is by design. The MPI
API has all the error checking stuff for ensuring that MPI_INIT
completed, error checking of parameters, etc. We never invoke the
top-level MPI API from elsewhere in the OMPI code base (except for
from within ROMIO; we didn't want to wholesale changes to that
package because it would make for extreme difficulty every time we
imported a new version). There's fault tolerance reasons why it's
not good to call back up to the top level MPI API, too.
But I agree with Andrew; if this is init-level stuff that is not
necessary to be exchanged on a per-communicator basis, then the modex
is probably your best bet. Avoid using the RML directly if possible.
wassen@elrohir:~/src/mpi_test$ mpirun -np 4 -host
pd-01,pd-02,pd-03,pd-04 -mca btl openib,self sendrecv
[3] received data[0]=1
[1] received data[0]=1
[1] received data[1]=2
[1] received data[2]=3
[1] received data[3]=4
[1] received data[4]=5
[1] received data[5]=6
[1] received data[6]=7
[1] received data[7]=8
[2] received data[0]=1
[2] received data[1]=2
[2] received data[2]=3
[2] received data[3]=4
[2] received data[4]=5
[2] received data[5]=6
[2] received data[6]=7
[2] received data[7]=8
[3] received data[1]=2
[3] received data[2]=3
[3] received data[3]=4
[3] received data[4]=5
[3] received data[5]=6
[3] received data[6]=7
[3] received data[7]=8
{20 sec. later...}
mpirun: killing job...
When I execute the same program with "-mca btl udapl,self" or "-mca
btl tcp,self", it runs fine and terminates in less than a second.
Tried with Open MPI 1.2.1 and 1.2.3. The test program runs fine
with several other MPIs (intel-mpi and mvapich with InfinniBand, mp-
mpich with SCI).
I hope, my information suffices to reproduce the problem.
Best regards,
Georg Wassen.
ps. I know that I could transmit the array in one MPI_Send, but
this is extracted from my real problem.
--------------------1st node-----------------------
wassen@pd-01:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
fw_ver: 1.2.0
node_guid: 0002:c902:0020:b680
sys_image_guid: 0002:c902:0020:b683
vendor_id: 0x02c9
vendor_part_id: 25204
hw_ver: 0xA0
board_id: MT_0230000001
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
---------------------------------------------------------
wassen@pd-01:~$ /sbin/ifconfig
...
ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04-
FE-80-00-00-00-00-00-00-00-00-00-00
inet Adresse:192.168.0.11 Bcast:192.168.0.255 Maske:
255.255.255.0
inet6 Adresse: fe80::202:c902:20:b681/64
Gültigkeitsbereich:Verbindung
UP BROADCAST RUNNING MULTICAST MTU:2044 Metric:1
RX packets:260 errors:0 dropped:0 overruns:0 frame:0
TX packets:331 errors:0 dropped:2 overruns:0 carrier:0
Kollisionen:0 Sendewarteschlangenlänge:128
RX bytes:14356 (14.0 KiB) TX bytes:24960 (24.3 KiB)
-------------------------------------------------------
#include "mpi.h"
#include <stdio.h>
#define NUM 16
int main(int argc, char **argv) {
int myrank, count;
MPI_Status status;
int data[NUM] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
int i, j;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_size(MPI_COMM_WORLD, &count);
if (myrank == 0) {
for (i=1; i<count; i++) {
for (j=0; j<NUM; j++) {
MPI_Send(&data[j], 1, MPI_INT, i, 99, MPI_COMM_WORLD);
}
}
} else {
for (j=0; j<NUM; j++) {
MPI_Recv(&data[j], 1, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);
printf("[%d] received data[%d]=%d\n", myrank, j, data[j]);
}
}
MPI_Finalize();
}
<config.log.gz>
<ompi_info_all.txt.gz>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems