On Jun 26, 2007, at 5:06 PM, Georg Wassen wrote:

Hello all,

I temporarily worked around my former problem by using synchronous communication and shifting the initialization
into the first call of a collective operation.

But nevertheless, I found a performance bug in btl_openib.

When I execute the attached sendrecv.c on 4 (or more) nodes of a Pentium D Cluster with Infinniband, each receiving process gets only 8 messages in some seconds and then does nothing for at least 20 sec. (I executed the following command and hit Ctrl-C 20 sec. after the last output)

This sounds like it could be a progression issue. When the openib BTL is used by itself, we crank the frequency of the file descriptor progression engine down very low because most progression will come from verbs (not select/poll). I wonder if this is somehow related.

FWIW: the reason you have to use PML_CALL() is by design. The MPI API has all the error checking stuff for ensuring that MPI_INIT completed, error checking of parameters, etc. We never invoke the top-level MPI API from elsewhere in the OMPI code base (except for from within ROMIO; we didn't want to wholesale changes to that package because it would make for extreme difficulty every time we imported a new version). There's fault tolerance reasons why it's not good to call back up to the top level MPI API, too.

But I agree with Andrew; if this is init-level stuff that is not necessary to be exchanged on a per-communicator basis, then the modex is probably your best bet. Avoid using the RML directly if possible.

wassen@elrohir:~/src/mpi_test$ mpirun -np 4 -host pd-01,pd-02,pd-03,pd-04 -mca btl openib,self sendrecv
[3] received data[0]=1
[1] received data[0]=1
[1] received data[1]=2
[1] received data[2]=3
[1] received data[3]=4
[1] received data[4]=5
[1] received data[5]=6
[1] received data[6]=7
[1] received data[7]=8
[2] received data[0]=1
[2] received data[1]=2
[2] received data[2]=3
[2] received data[3]=4
[2] received data[4]=5
[2] received data[5]=6
[2] received data[6]=7
[2] received data[7]=8
[3] received data[1]=2
[3] received data[2]=3
[3] received data[3]=4
[3] received data[4]=5
[3] received data[5]=6
[3] received data[6]=7
[3] received data[7]=8
{20 sec. later...}
mpirun: killing job...

When I execute the same program with "-mca btl udapl,self" or "-mca btl tcp,self", it runs fine and terminates in less than a second. Tried with Open MPI 1.2.1 and 1.2.3. The test program runs fine with several other MPIs (intel-mpi and mvapich with InfinniBand, mp- mpich with SCI).

I hope, my information suffices to reproduce the problem.

Best regards,
Georg Wassen.

ps. I know that I could transmit the array in one MPI_Send, but this is extracted from my real problem.



--------------------1st node-----------------------
wassen@pd-01:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
       fw_ver:                         1.2.0
       node_guid:                      0002:c902:0020:b680
       sys_image_guid:                 0002:c902:0020:b683
       vendor_id:                      0x02c9
       vendor_part_id:                 25204
       hw_ver:                         0xA0
       board_id:                       MT_0230000001
       phys_port_cnt:                  1
               port:   1
                       state:                  PORT_ACTIVE (4)
                       max_mtu:                2048 (4)
                       active_mtu:             2048 (4)
                       sm_lid:                 1
                       port_lid:               1
                       port_lmc:               0x00

---------------------------------------------------------
wassen@pd-01:~$ /sbin/ifconfig
...
ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04- FE-80-00-00-00-00-00-00-00-00-00-00 inet Adresse:192.168.0.11 Bcast:192.168.0.255 Maske: 255.255.255.0 inet6 Adresse: fe80::202:c902:20:b681/64 Gültigkeitsbereich:Verbindung
         UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
         RX packets:260 errors:0 dropped:0 overruns:0 frame:0
         TX packets:331 errors:0 dropped:2 overruns:0 carrier:0
         Kollisionen:0 Sendewarteschlangenlänge:128
         RX bytes:14356 (14.0 KiB)  TX bytes:24960 (24.3 KiB)
-------------------------------------------------------
#include "mpi.h"
#include <stdio.h>

#define NUM 16

int main(int argc, char **argv) {
    int myrank, count;
    MPI_Status status;


    int data[NUM] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    int i, j;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &count);

    if (myrank == 0) {
      for (i=1; i<count; i++) {
        for (j=0; j<NUM; j++) {
          MPI_Send(&data[j], 1, MPI_INT, i, 99, MPI_COMM_WORLD);
        }
      }
    } else {
      for (j=0; j<NUM; j++) {
        MPI_Recv(&data[j], 1, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);
        printf("[%d] received data[%d]=%d\n", myrank, j, data[j]);
      }
    }

    MPI_Finalize();
}
<config.log.gz>
<ompi_info_all.txt.gz>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems


Reply via email to