We are running into IBV_WC_RETRY_EXC_ERR errors with large rdma_reads using iMPI and IMB alltoallv. Problem always occurs between processes on the same node. Loopback issue?

Has anyone else run into rdma_read issues like this?

Here are details:

2 node Clovertown X5355 servers (8 cores each), RHEL4u4, iMPI 3.0.

retry_count is set to 7

[EMAIL PROTECTED] src]$ ibv_devinfo
hca_id: mthca0
        fw_ver:                         4.8.200
        node_guid:                      0002:c902:0000:4fa8
        sys_image_guid:                 0002:c902:0000:4fa8
        vendor_id:                      0x02c9
        vendor_part_id:                 25208
        hw_ver:                         0xA0
        board_id:                       MT_00A0000001
        phys_port_cnt:                  2



[EMAIL PROTECTED] src]$ mpiexec -perhost 8 -n 8 -env DAPL_DBG_TYPE 0x83 -env I_MPI_DEBUG 0 -env I_MPI_DEVICE rdma ./IMB-MPI1 alltoallv -npmin 16
#---------------------------------------------------
#    Intel (R) MPI Benchmark Suite V3.0, MPI-1 part
#---------------------------------------------------
# Date                  : Fri Sep 28 12:26:05 2007
# Machine               : x86_64
# System                : Linux
# Release               : 2.6.9-42.ELsmp
# Version               : #1 SMP Wed Jul 12 23:32:02 EDT 2006
# MPI Version           : 2.0
# MPI Thread Environment: MPI_THREAD_SINGLE

#
# Minimum message length in bytes:   0
# Maximum message length in bytes:   4194304
#
# MPI_Datatype                   :   MPI_BYTE
# MPI_Datatype for reductions    :   MPI_FLOAT
# MPI_Op                         :   MPI_SUM
#
#

# List of Benchmarks to run:

# Alltoallv

#----------------------------------------------------------------
# Benchmarking Alltoallv
# #processes = 8
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.66         0.70         0.68
            1         1000       483.97       484.49       484.32
            2         1000       483.35       483.41       483.37
            4         1000       484.29       484.41       484.39
            8         1000       483.86       484.01       483.97
           16         1000       479.72       479.87       479.82
           32         1000       483.95       484.07       484.00
           64         1000       482.13       482.27       482.22
          128         1000       485.00       485.13       485.09
          256         1000       485.93       486.06       486.00
          512         1000       487.68       487.78       487.72
         1024         1000       487.82       487.98       487.94
         2048         1000       497.09       497.27       497.21
         4096         1000       510.79       510.95       510.86
         8192         1000       506.51       506.64       506.59
        16384         1000       642.15       642.26       642.21
        32768         1000      1816.55      1816.80      1816.67
        65536          640      2926.42      2926.65      2926.51
       131072          320      5214.20      5215.18      5214.64
       262144          160     10018.31     10021.30     10020.22
       524288           80     19554.79     19581.09     19573.01
      1048576           40     43291.05     43342.45     43323.24
      2097152           20    109898.01    110455.85    110361.47
 DTO completion ERROR: 12: op 0x2
 DTO completion ERROR: 12: op 0x2 (ep disconnected)
[0][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with error. status=0x1. cookie=0x0
 DTO completion ERROR: 5: op 0x2
[7][rdma_iba.c:193] Intel MPI fatal error: DTO operation completed with error. status=0x8. cookie=0x4

Thanks,

-arlin
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to