Re: [OMPI devel] 1.7rc8 is posted

Pavel Mezentsev Wed, 27 Feb 2013 19:36:43 -0500

I've tried the new rc. Here is what I got:

1) I've successfully built it with intel-13.1 and gcc-4.7.2. But I've
failed while using open64-4.5.2 and ekopath-5.0.0 (pathscale). The problems
are in the fortran part. In each case I've used the following configuration
line:
CC=$CC CXX=$CXX F77=$F77 FC=$FC ./configure --prefix=$prefix
--with-knem=$knem_path
Open64 failed during configuration with the following:
*** Fortran compiler
checking whether we are using the GNU Fortran compiler... yes
checking whether openf95 accepts -g... yes
configure: WARNING: Open MPI now ignores the F77 and FFLAGS environment
variables; only the FC and FCFLAGS environment variables are used.
checking whether ln -s works... yes
checking if Fortran compiler works... yes
checking for extra arguments to build a shared library... none needed
checking for Fortran flag to compile .f files... none
checking for Fortran flag to compile .f90 files... none
checking to see if Fortran compilers need additional linker flags... none
checking  external symbol convention... double underscore
checking if C and Fortran are link compatible... yes
checking to see if Fortran compiler likes the C++ exception flags...
skipped (no C++ exceptions flags)
checking to see if mpifort compiler needs additional linker flags... none
checking if Fortran compiler supports CHARACTER... yes
checking size of Fortran CHARACTER... 1
checking for C type corresponding to CHARACTER... char
checking alignment of Fortran CHARACTER... 1
checking for corresponding KIND value of CHARACTER... C_SIGNED_CHAR
checking KIND value of Fortran C_SIGNED_CHAR... no ISO_C_BINDING -- fallback
checking Fortran value of selected_int_kind(4)... no
configure: WARNING: Could not determine KIND value of C_SIGNED_CHAR
configure: WARNING: See config.log for more details
configure: error: Cannot continue


Ekopath failed during make with the following error:
 PPFC     mpi-f08-sizeof.lo
  PPFC     mpi-f08.lo
In file included from mpi-f08.F90:37:
mpi-f-interfaces-bind.h:1908: warning: extra tokens at end of #endif
directive
mpi-f-interfaces-bind.h:2957: warning: extra tokens at end of #endif
directive
In file included from mpi-f08.F90:38:
pmpi-f-interfaces-bind.h:1911: warning: extra tokens at end of #endif
directive
pmpi-f-interfaces-bind.h:2963: warning: extra tokens at end of #endif
directive
pathf95-1044 pathf95: INTERNAL OMPI_OP_CREATE_F, File =
mpi-f-interfaces-bind.h, Line = 955, Column = 29
  Internal : Unexpected ATP_PGM_UNIT in check_interoperable_pgm_unit()
make[2]: *** [mpi-f08.lo] Error 1
make[2]: Leaving directory
`/tmp/mpi_install_tmp1400/openmpi-1.7rc8/ompi/mpi/fortran/use-mpi-f08'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/tmp/mpi_install_tmp1400/openmpi-1.7rc8/ompi'
make: *** [all-recursive] Error 1

It seems to be different from the error I got last time with rc7. And again
I'm not a fortran guy to understand this error. I've used the following
version of the compiler:
http://c591116.r16.cf2.rackcdn.com/ekopath/nightly/Linux/ekopath-2013-02-26-installer.run

2) I've ran a couple of tests (IMB) with the new version. I ran this on a
system consisting of 10 nodes with Intel SB processor and fdr ConnectX3
infiniband adapters.
First I've tried the following parameters:
mpirun -np $NP -hostfile hosts --mca btl
openib,sm,self --bind-to-core -npernode 16 --mca mpi_leave_pinned
1 ./IMB-MPI1 -npmin $NP -mem 4G $COLL
This combination complained about mca_leave_pinned. The same line works for
1.6.3. Is something different in the new release and I've missed it?
--------------------------------------------------------------------------
A process attempted to use the "leave pinned" MPI feature, but no
memory registration hooks were found on the system at run time.  This
may be the result of running on a system that does not support memory
hooks or having some other software subvert Open MPI's use of the
memory hooks.  You can disable Open MPI's use of memory hooks by
setting both the mpi_leave_pinned and mpi_leave_pinned_pipeline MCA
parameters to 0.

Open MPI will disable any transports that are attempting to use the
leave pinned functionality; your job may still run, but may fall back
to a slower network transport (such as TCP).

  Mpool name: grdma
  Process:    [[13305,1],1]
  Local host: b23
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There is at least one OpenFabrics device found but there are
no active ports detected (or Open MPI was unable to use them).  This
is most certainly not what you wanted.  Check your cables, subnet
manager configuration, etc.  The openib BTL will be ignored for this
job.

  Local host: b23
--------------------------------------------------------------------------
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[13305,1],0]) is on host: b22
  Process 2 ([[13305,1],1]) is on host: b23
  BTLs attempted: self sm

Your MPI job is now going to abort; sorry.
...

Then I ran a couple of P2P and collective tests. In general the performance
improved compared to 1.6.3. But there are several cases where it got worse.
Perhaps I need to use some tuning, could you please tell me what parameters
would suite me better then the default.
Here is what I got for PingPong and PingPing in 1.7rc8 (the above
parameters changed to have "-npernode 1"):
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.39         0.00
            1         1000         1.50         0.64
            2         1000         1.10         1.73
            4         1000         1.10         3.46
            8         1000         1.12         6.80
           16         1000         1.12        13.62
           32         1000         1.14        26.75
           64         1000         1.18        51.92
          128         1000         1.73        70.42
          256         1000         1.85       132.04
          512         1000         1.98       247.16
         1024         1000         2.26       431.52
         2048         1000         2.85       684.58
         4096         1000         3.49      1118.63
         8192         1000         4.48      1741.96
        16384         1000         9.58      1630.92
        32768         1000        14.27      2189.46
        65536          640        23.03      2713.71
       131072          320        35.55      3515.73
       262144          160        57.65      4336.77
       524288           80       101.42      4930.05
      1048576           40       188.00      5319.18
      2097152           20       521.70      3833.61
      4194304           10      1118.20      3577.19

#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.26         0.00
            1         1000         1.32         0.72
            2         1000         1.32         1.44
            4         1000         1.35         2.84
            8         1000         1.38         5.53
           16         1000         1.13        13.51
           32         1000         1.13        26.96
           64         1000         1.17        51.95
          128         1000         1.72        70.96
          256         1000         1.80       135.63
          512         1000         1.94       251.17
         1024         1000         2.23       437.51
         2048         1000         2.88       677.47
         4096         1000         3.49      1119.28
         8192         1000         4.75      1643.41
        16384         1000         9.90      1578.12
        32768         1000        14.54      2149.25
        65536          640        24.04      2599.79
       131072          320        37.00      3378.35
       262144          160        60.25      4149.39
       524288           80       105.74      4728.77
      1048576           40       196.73      5083.23
      2097152           20       785.79      2545.20
      4194304           10      1790.19      2234.40

And 1.6.3 gave the following:
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         1.06         0.00
            1         1000         0.94         1.01
            2         1000         0.95         2.02
            4         1000         0.95         4.01
            8         1000         0.97         7.90
           16         1000         0.98        15.63
           32         1000         0.99        30.86
           64         1000         1.02        59.60
          128         1000         1.58        77.23
          256         1000         1.71       142.73
          512         1000         1.86       263.15
         1024         1000         2.13       459.35
         2048         1000         2.72       718.31
         4096         1000         3.27      1194.74
         8192         1000         4.33      1802.57
        16384         1000         6.20      2521.78
        32768         1000         8.84      3535.46
        65536          640        14.28      4376.82
       131072          320        24.97      5005.06
       262144          160        44.94      5562.46
       524288           80        86.76      5763.29
      1048576           40       168.73      5926.77
      2097152           20       333.65      5994.32
      4194304           10       666.09      6005.16

#---------------------------------------------------
# Benchmarking PingPing
# #processes = 2
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
            0         1000         0.93         0.00
            1         1000         0.97         0.98
            2         1000         0.97         1.97
            4         1000         0.97         3.94
            8         1000         0.99         7.70
           16         1000         0.99        15.34
           32         1000         1.01        30.21
           64         1000         1.05        58.13
          128         1000         1.61        75.82
          256         1000         1.73       141.20
          512         1000         1.88       259.87
         1024         1000         2.17       450.21
         2048         1000         2.83       691.13
         4096         1000         3.45      1131.26
         8192         1000         4.76      1639.88
        16384         1000         7.76      2014.01
        32768         1000        10.34      3021.35
        65536          640        16.29      3836.55
       131072          320        26.72      4678.40
       262144          160        48.83      5120.31
       524288           80        91.85      5443.61
      1048576           40       178.65      5597.63
      2097152           20       351.31      5692.98
      4194304           10       701.69      5700.53

The sendrecv and exchange also got worse. I can send additional data if
needed.

The performance on collectives generally has slightly improved comparing to
1.6.3 or remained the same. But in certain cases I got much better results
with tuned_collectives. In particular those suited my system better:
--mca coll_tuned_barrier_algorithm 6 (default and tuned):
#---------------------------------------------------
# Benchmarking Barrier
# #processes = 160
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        49.75        49.77        49.76
#---------------------------------------------------
# Benchmarking Barrier
# #processes = 160
#---------------------------------------------------
 #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
         1000        12.74        12.74        12.74

Bcast for small messages
--mca coll_tuned_bcast_algorithm 3 (default and tuned):
#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 160
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.01         0.02         0.02
            1         1000         9.87         9.96         9.92
            2         1000        10.44        10.51        10.47
            4         1000        10.30        10.37        10.34
            8         1000        10.34        10.43        10.38
           16         1000        10.39        10.48        10.43
           32         1000        10.36        10.43        10.40
           64         1000        10.38        10.44        10.41
          128         1000        10.11        10.22        10.17
          256         1000        11.37        11.54        11.48
          512         1000        14.09        14.25        14.19
         1024         1000        18.77        19.03        18.94
         2048         1000        13.47        13.63        13.58
         4096         1000        25.39        25.60        25.55
         8192         1000        50.80        51.11        51.04
        16384         1000       102.64       103.53       103.38
        32768         1000       280.86       281.80       281.62
        65536          640       387.10       391.90       391.26
       131072          320       779.58       796.04       794.30
       262144          160      1526.52      1597.39      1590.31
       524288           80       355.67       379.06       375.27
      1048576           40       702.95       753.65       736.29
      2097152           20      1518.11      1580.85      1551.57
      4194304           10      3183.22      3931.81      3676.94

#----------------------------------------------------------------
# Benchmarking Bcast
# #processes = 160
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.01         0.02         0.02
            1         1000         4.54         5.13         4.85
            2         1000         4.50         5.11         4.81
            4         1000         4.50         5.09         4.80
            8         1000         4.48         5.09         4.79
           16         1000         4.49         5.09         4.79
           32         1000         4.55         5.15         4.86
           64         1000         4.52         5.14         4.83
          128         1000         4.66         5.28         4.98
          256         1000         4.78         5.40         5.09
          512         1000         4.89         5.52         5.21
         1024         1000         5.15         5.81         5.48
         2048         1000         5.60         6.30         5.94
         4096         1000         8.25         8.67         8.46
         8192         1000        10.49        11.01        10.76
        16384         1000        20.05        20.87        20.50
        32768         1000        30.11        31.41        30.80
        65536          640        46.08        48.94        47.54
       131072          320        75.53        84.98        80.26
       262144          160       134.26       169.44       151.92
       524288           80       240.34       372.76       307.80
      1048576           40       427.00       951.02       699.41
      2097152           20       933.41      3170.45      2076.21
      4194304           10      2682.40     16020.39      9718.86

and AllGatherv:
--mca coll_tuned_allgatherv_algorithm 5 (default and tuned):
#----------------------------------------------------------------
# Benchmarking Allgatherv
# #processes = 160
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.06         0.07         0.06
            1         1000        54.11        54.15        54.13
            2         1000        52.74        52.78        52.76
            4         1000        55.09        55.13        55.11
            8         1000        58.48        58.52        58.50
           16         1000        61.99        62.03        62.01
           32         1000        69.31        69.35        69.32
           64         1000        88.13        88.18        88.16
          128         1000       126.62       126.71       126.68
          256         1000       215.26       215.34       215.31
          512         1000       832.54       833.01       832.57
         1024         1000       928.81       929.31       928.86
         2048         1000      1072.77      1073.35      1072.85
         4096         1000      1222.82      1223.42      1222.90
         8192         1000      1713.46      1714.13      1713.87
        16384         1000      2596.87      2598.31      2597.40
        32768         1000      4153.70      4154.09      4153.92
        65536          640      6795.04      6796.32      6795.83
       131072          320     12076.74     12083.04     12080.28
       262144          160     23120.98     23153.76     23138.10
       524288           80     49077.99     49204.79     49142.48
      1048576           40    132120.25    132675.60    132400.38
      2097152           20    240537.20    241821.05    241138.53
      4194304           10    457125.71    459065.10    458035.03

#----------------------------------------------------------------
# Benchmarking Allgatherv
# #processes = 160
#----------------------------------------------------------------
       #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
            0         1000         0.06         0.07         0.06
            1         1000         0.47         0.56         0.52
            2         1000         0.47         0.57         0.51
            4         1000         0.48         0.56         0.52
            8         1000         0.46         0.56         0.51
           16         1000         0.47         0.57         0.52
           32         1000         0.47         0.56         0.52
           64         1000         0.47         0.57         0.52
          128         1000         0.50         0.62         0.57
          256         1000         0.58         0.68         0.63
          512         1000         0.62         0.81         0.70
         1024         1000         0.71         0.97         0.80
         2048         1000         0.89         1.24         1.05
         4096         1000         2.21         2.58         2.40
         8192         1000         3.08         3.55         3.30
        16384         1000         4.77         5.56         5.11
        32768         1000         7.99         9.75         8.90
        65536          640        15.81        19.35        17.69
       131072          320        34.18        39.74        36.95
       262144          160        71.72        80.37        76.06
       524288           80       143.64       161.81       152.36
      1048576           40       781.10       868.80       825.57
      2097152           20      2594.30      2795.45      2672.58
      4194304           10      5185.79      5451.20      5298.98

This time I only ran the test on 160 processes but before I've done more
testing with 1.6 on different number of processes (from 16 to 320) and
those tuned parameters helped almost each time. I don't know what are
default parameters tuned for but perhaps it may be a good idea to change
the defaults for the kind of system I use.



I can perform some additional tests if necessary or give more information
on the problems that I've came across.

Regards, Pavel Mezentsev.


2013/2/27 Jeff Squyres (jsquyres) <jsquy...@cisco.com>

> The goal is to release 1.7 (final) by the end of this week.  New rc posted
> with fairly small changes:
>
>     http://www.open-mpi.org/software/ompi/v1.7/
>
> - Fix wrong header file / compilation error in bcol
> - Support MXM STREAM for isend and irecv
> - Make sure "mpirun <dirname>" fails with $status!=0
> - Bunches of cygwin minor fixes
> - Make sure the fortran compiler supports BIND(C) with LOGICAL for the F08
> bindings
> - Fix --disable-mpi-io with the F08 bindings
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] 1.7rc8 is posted

Reply via email to