Re: [OMPI devel] problem with openib, was: send/recv during initialization

Jeff Squyres Thu, 28 Jun 2007 02:25:28 -0400

On Jun 26, 2007, at 5:06 PM, Georg Wassen wrote:

Hello all,
I temporarily worked around my former problem by using synchronouscommunication and shifting the initialization
into the first call of a collective operation.

But nevertheless, I found a performance bug in btl_openib.
When I execute the attached sendrecv.c on 4 (or more) nodes of aPentium D Cluster with Infinniband,each receiving process gets only 8 messages in some seconds andthen does nothing for at least 20 sec.(I executed the following command and hit Ctrl-C 20 sec. after thelast output)

This sounds like it could be a progression issue. When the openibBTL is used by itself, we crank the frequency of the file descriptorprogression engine down very low because most progression will comefrom verbs (not select/poll). I wonder if this is somehow related.

FWIW: the reason you have to use PML_CALL() is by design. The MPIAPI has all the error checking stuff for ensuring that MPI_INITcompleted, error checking of parameters, etc. We never invoke thetop-level MPI API from elsewhere in the OMPI code base (except forfrom within ROMIO; we didn't want to wholesale changes to thatpackage because it would make for extreme difficulty every time weimported a new version). There's fault tolerance reasons why it'snot good to call back up to the top level MPI API, too.

But I agree with Andrew; if this is init-level stuff that is notnecessary to be exchanged on a per-communicator basis, then the modexis probably your best bet. Avoid using the RML directly if possible.

wassen@elrohir:~/src/mpi_test$ mpirun -np 4 -hostpd-01,pd-02,pd-03,pd-04 -mca btl openib,self sendrecv

[3] received data[0]=1
[1] received data[0]=1
[1] received data[1]=2
[1] received data[2]=3
[1] received data[3]=4
[1] received data[4]=5
[1] received data[5]=6
[1] received data[6]=7
[1] received data[7]=8
[2] received data[0]=1
[2] received data[1]=2
[2] received data[2]=3
[2] received data[3]=4
[2] received data[4]=5
[2] received data[5]=6
[2] received data[6]=7
[2] received data[7]=8
[3] received data[1]=2
[3] received data[2]=3
[3] received data[3]=4
[3] received data[4]=5
[3] received data[5]=6
[3] received data[6]=7
[3] received data[7]=8
{20 sec. later...}
mpirun: killing job...

When I execute the same program with "-mca btl udapl,self" or "-mcabtl tcp,self", it runs fine and terminates in less than a second.Tried with Open MPI 1.2.1 and 1.2.3. The test program runs finewith several other MPIs (intel-mpi and mvapich with InfinniBand, mp-mpich with SCI).


I hope, my information suffices to reproduce the problem.

Best regards,
Georg Wassen.

ps. I know that I could transmit the array in one MPI_Send, butthis is extracted from my real problem.




--------------------1st node-----------------------
wassen@pd-01:~$ /opt/infiniband/bin/ibv_devinfo
hca_id: mthca0
       fw_ver:                         1.2.0
       node_guid:                      0002:c902:0020:b680
       sys_image_guid:                 0002:c902:0020:b683
       vendor_id:                      0x02c9
       vendor_part_id:                 25204
       hw_ver:                         0xA0
       board_id:                       MT_0230000001
       phys_port_cnt:                  1
               port:   1
                       state:                  PORT_ACTIVE (4)
                       max_mtu:                2048 (4)
                       active_mtu:             2048 (4)
                       sm_lid:                 1
                       port_lid:               1
                       port_lmc:               0x00

---------------------------------------------------------
wassen@pd-01:~$ /sbin/ifconfig
...

ib0 Protokoll:UNSPEC Hardware Adresse 00-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00inet Adresse:192.168.0.11 Bcast:192.168.0.255 Maske:255.255.255.0inet6 Adresse: fe80::202:c902:20:b681/64Gültigkeitsbereich:Verbindung

         UP BROADCAST RUNNING MULTICAST  MTU:2044  Metric:1
         RX packets:260 errors:0 dropped:0 overruns:0 frame:0
         TX packets:331 errors:0 dropped:2 overruns:0 carrier:0
         Kollisionen:0 Sendewarteschlangenlänge:128
         RX bytes:14356 (14.0 KiB)  TX bytes:24960 (24.3 KiB)
-------------------------------------------------------
#include "mpi.h"
#include <stdio.h>

#define NUM 16

int main(int argc, char **argv) {
    int myrank, count;
    MPI_Status status;


    int data[NUM] = {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16};
    int i, j;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
    MPI_Comm_size(MPI_COMM_WORLD, &count);

    if (myrank == 0) {
      for (i=1; i<count; i++) {
        for (j=0; j<NUM; j++) {
          MPI_Send(&data[j], 1, MPI_INT, i, 99, MPI_COMM_WORLD);
        }
      }
    } else {
      for (j=0; j<NUM; j++) {
        MPI_Recv(&data[j], 1, MPI_INT, 0, 99, MPI_COMM_WORLD, &status);
        printf("[%d] received data[%d]=%d\n", myrank, j, data[j]);
      }
    }

    MPI_Finalize();
}
<config.log.gz>
<ompi_info_all.txt.gz>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] problem with openib, was: send/recv during initialization

Reply via email to