I performed some tests on our Omnipath cluster, and I have a mixed bag of 
results with 4.0.0rc1


1.       Good news, the problems with the psm2 mtl that I reported in June/July 
seem to be fixed. I still get however a warning every time I run a job with 
4.0.0, e.g.



compute-1-1.local.4351PSM2 has not been initialized

compute-1-0.local.3826PSM2 has not been initialized

although based on the performance, it is very clear that psm2 is being used. I 
double checked with 3.0 series, I do not get the same warnings on the same
set of nodes. The unfortunate part about this error  message is, that it seems 
that applications seem to return an error (although tests and applications seem 
to
finish correctly otherwise)

--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus 
causing
the job to be terminated. The first process to do so was:

              Process name: [[38418,1],1]
              Exit code:    255
              
----------------------------------------------------------------------------


2.       The ofi mtl does not work at all on our Omnipath cluster. If I try to 
force it using ‘mpirun –mca mtl ofi …’ I get the following error message.



[compute-1-0:03988] *** An error occurred in MPI_Barrier

[compute-1-0:03988] *** reported by process [2712141825,0]

[compute-1-0:03988] *** on communicator MPI_COMM_WORLD

[compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list

[compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
will now abort,

[compute-1-0:03988] ***    and potentially your MPI job)

[sabine.cacds.uh.edu:21046] 1 more process has sent help message 
help-mpi-errors.txt / mpi_errors_are_fatal

[sabine.cacds.uh.edu:21046] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages



I once again double checked that this works correctly in the 3.0 (and 3.1, 
although I did not run that test this time).



3.       The openib btl component is always getting in the way with annoying 
warnings. It is not really used, but constantly complains:



[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / ib port not selected
[sabine.cacds.uh.edu:25996] Set MCA parameter "orte_base_help_aggregate" to 0 
to see all help / error messages
[sabine.cacds.uh.edu:25996] 1 more process has sent help message 
help-mpi-btl-openib.txt / error in device init

So bottom line, if I do

mpirun –mca btl^openib –mca mtl^ofi ….

my tests finish correctly, although mpirun will still return an error.

Thanks
Edgar


From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Geoffrey 
Paulsen
Sent: Sunday, September 16, 2018 2:31 PM
To: devel@lists.open-mpi.org
Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1


The first release candidate for the Open MPI v4.0.0 release is posted at

https://www.open-mpi.org/software/ompi/v4.0/

Major changes include:



4.0.0 -- September, 2018

------------------------



- OSHMEM updated to the OpenSHMEM 1.4 API.

- Do not build Open SHMEM layer when there are no SPMLs available.

  Currently, this means the Open SHMEM layer will only build if

  a MXM or UCX library is found.

- A UCX BTL was added for enhanced MPI RMA support using UCX

- With this release,  OpenIB BTL now only supports iWarp and RoCE by default.

- Updated internal HWLOC to 2.0.1

- Updated internal PMIx to 3.0.1

- Change the priority for selecting external verses internal HWLOC

  and PMIx packages to build.  Starting with this release, configure

  by default selects available external HWLOC and PMIx packages over

  the internal ones.

- Updated internal ROMIO to 3.2.1.

- Removed support for the MXM MTL.

- Improved CUDA support when using UCX.

- Improved support for two phase MPI I/O operations when using OMPIO.

- Added support for Software-based Performance Counters, see

  
https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI-
 Various improvements to MPI RMA performance when using RDMA

  capable interconnects.

- Update memkind component to use the memkind 1.6 public API.

- Fix problems with use of newer map-by mpirun options.  Thanks to

  Tony Reina for reporting.

- Fix rank-by algorithms to properly rank by object and span

- Allow for running as root of two environment variables are set.

  Requested by Axel Huebl.

- Fix a problem with building the Java bindings when using Java 10.

  Thanks to Bryce Glover for reporting.

Our goal is to release 4.0.0 by mid Oct, so any testing is appreciated.


_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to