Mattias,

IIRC, OFI BTL only create one EP. If you move it to add_proc, you might need to 
add some checks to not re-creating EP over and over. Do you think moving EP 
creation from component_init to component_open will solve the problem?

Arm

> On Sep 19, 2018, at 1:08 PM, Cabral, Matias A <matias.a.cab...@intel.com> 
> wrote:
> 
> Hi Edgar, <>
>  
> I also saw some similar issues, not exactly the same, but look very similar 
> (may be because of different version of libpsm2 ). 1 and 2 are related to the 
> introduction of the OFI BTL and the fact that it opens an OFI EP in its init 
> function. I see that all btls call the init function during transport 
> selection time. Moreover, this happens even when you explicitly ask for a 
> different one (-mca pml cm -mca mtl psm2).  Workaround:  -mca btl ^ofi.  My 
> current idea is to update the OFI BTL and move the EPs opening to add_procs. 
> Feedback?
>  
> Number 3 is goes beyond me.
>  
> Thanks,
>  
> _MAC
>  
>  <>From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of 
> Gabriel, Edgar
> Sent: Wednesday, September 19, 2018 9:25 AM
> To: Open MPI Developers <devel@lists.open-mpi.org>
> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1
>  
> I performed some tests on our Omnipath cluster, and I have a mixed bag of 
> results with 4.0.0rc1
>  
> 1.       Good news, the problems with the psm2 mtl that I reported in 
> June/July seem to be fixed. I still get however a warning every time I run a 
> job with 4.0.0, e.g.
>  
> compute-1-1.local.4351PSM2 has not been initialized
> compute-1-0.local.3826PSM2 has not been initialized
>  
> although based on the performance, it is very clear that psm2 is being used. 
> I double checked with 3.0 series, I do not get the same warnings on the same 
> set of nodes. The unfortunate part about this error  message is, that it 
> seems that applications seem to return an error (although tests and 
> applications seem to 
> finish correctly otherwise)
>  
> --------------------------------------------------------------------------
> mpirun detected that one or more processes exited with non-zero status, thus 
> causing
> the job to be terminated. The first process to do so was:
>  
>               Process name: [[38418,1],1]
>               Exit code:    255
>               
> ----------------------------------------------------------------------------
>  
> 2.       The ofi mtl does not work at all on our Omnipath cluster. If I try 
> to force it using ‘mpirun –mca mtl ofi …’ I get the following error message.
>  
> [compute-1-0:03988] *** An error occurred in MPI_Barrier
> [compute-1-0:03988] *** reported by process [2712141825,0]
> [compute-1-0:03988] *** on communicator MPI_COMM_WORLD
> [compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list
> [compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator 
> will now abort,
> [compute-1-0:03988] ***    and potentially your MPI job)
> [sabine.cacds.uh.edu:21046] 1 more process has sent help message 
> help-mpi-errors.txt / mpi_errors_are_fatal
> [sabine.cacds.uh.edu:21046] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see all help / error messages
>  
> I once again double checked that this works correctly in the 3.0 (and 3.1, 
> although I did not run that test this time).
>  
> 3.       The openib btl component is always getting in the way with annoying 
> warnings. It is not really used, but constantly complains:
>  
>  
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message 
> help-mpi-btl-openib.txt / ib port not selected
> [sabine.cacds.uh.edu:25996] Set MCA parameter "orte_base_help_aggregate" to 0 
> to see all help / error messages
> [sabine.cacds.uh.edu:25996] 1 more process has sent help message 
> help-mpi-btl-openib.txt / error in device init
>  
> So bottom line, if I do
>  
> mpirun –mca btl^openib –mca mtl^ofi ….
>  
> my tests finish correctly, although mpirun will still return an error.
>  
> Thanks
> Edgar
>  
>  
> From: devel [mailto:devel-boun...@lists.open-mpi.org 
> <mailto:devel-boun...@lists.open-mpi.org>] On Behalf Of Geoffrey Paulsen
> Sent: Sunday, September 16, 2018 2:31 PM
> To: devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
> Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1
>  
> The first release candidate for the Open MPI v4.0.0 release is posted at 
> https://www.open-mpi.org/software/ompi/v4.0/ 
> <https://www.open-mpi.org/software/ompi/v4.0/>
> Major changes include:
>  
> 4.0.0 -- September, 2018
> ------------------------
>  
> - OSHMEM updated to the OpenSHMEM 1.4 API.
> - Do not build Open SHMEM layer when there are no SPMLs available.
>   Currently, this means the Open SHMEM layer will only build if
>   a MXM or UCX library is found.
> - A UCX BTL was added for enhanced MPI RMA support using UCX
> - With this release,  OpenIB BTL now only supports iWarp and RoCE by default.
> - Updated internal HWLOC to 2.0.1
> - Updated internal PMIx to 3.0.1
> - Change the priority for selecting external verses internal HWLOC
>   and PMIx packages to build.  Starting with this release, configure
>   by default selects available external HWLOC and PMIx packages over
>   the internal ones.
> - Updated internal ROMIO to 3.2.1.
> - Removed support for the MXM MTL.
> - Improved CUDA support when using UCX.
> - Improved support for two phase MPI I/O operations when using OMPIO.
> - Added support for Software-based Performance Counters, see
>   
> https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI
>  
> <https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI>-
>  Various improvements to MPI RMA performance when using RDMA
>   capable interconnects.
> - Update memkind component to use the memkind 1.6 public API.
> - Fix problems with use of newer map-by mpirun options.  Thanks to
>   Tony Reina for reporting.
> - Fix rank-by algorithms to properly rank by object and span
> - Allow for running as root of two environment variables are set.
>   Requested by Axel Huebl.
> - Fix a problem with building the Java bindings when using Java 10.
>   Thanks to Bryce Glover for reporting.
> Our goal is to release 4.0.0 by mid Oct, so any testing is appreciated.
>  
>  
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to