Hi Arm, > IIRC, OFI BTL only create one EP Correct. But only one is needed to trigger the below issues. There are different manifestations according to combinations of MTL OFI/PSM2, the version of libpsm2, and the support of OFI Scalable Eps.
> Do you think moving EP creation from component_init to component_open will > solve the problem? If the component_open is only called when the component is/will be effectively used, it may work. Let me check. Thanks, _MAC From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Thananon Patinyasakdikul Sent: Wednesday, September 19, 2018 10:15 AM To: Open MPI Developers <devel@lists.open-mpi.org> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1 Mattias, IIRC, OFI BTL only create one EP. If you move it to add_proc, you might need to add some checks to not re-creating EP over and over. Do you think moving EP creation from component_init to component_open will solve the problem? Arm On Sep 19, 2018, at 1:08 PM, Cabral, Matias A <matias.a.cab...@intel.com<mailto:matias.a.cab...@intel.com>> wrote: Hi Edgar, I also saw some similar issues, not exactly the same, but look very similar (may be because of different version of libpsm2 ). 1 and 2 are related to the introduction of the OFI BTL and the fact that it opens an OFI EP in its init function. I see that all btls call the init function during transport selection time. Moreover, this happens even when you explicitly ask for a different one (-mca pml cm -mca mtl psm2). Workaround: -mca btl ^ofi. My current idea is to update the OFI BTL and move the EPs opening to add_procs. Feedback? Number 3 is goes beyond me. Thanks, _MAC From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Gabriel, Edgar Sent: Wednesday, September 19, 2018 9:25 AM To: Open MPI Developers <devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>> Subject: Re: [OMPI devel] Announcing Open MPI v4.0.0rc1 I performed some tests on our Omnipath cluster, and I have a mixed bag of results with 4.0.0rc1 1. Good news, the problems with the psm2 mtl that I reported in June/July seem to be fixed. I still get however a warning every time I run a job with 4.0.0, e.g. compute-1-1.local.4351PSM2 has not been initialized compute-1-0.local.3826PSM2 has not been initialized although based on the performance, it is very clear that psm2 is being used. I double checked with 3.0 series, I do not get the same warnings on the same set of nodes. The unfortunate part about this error message is, that it seems that applications seem to return an error (although tests and applications seem to finish correctly otherwise) -------------------------------------------------------------------------- mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was: Process name: [[38418,1],1] Exit code: 255 ---------------------------------------------------------------------------- 2. The ofi mtl does not work at all on our Omnipath cluster. If I try to force it using ‘mpirun –mca mtl ofi …’ I get the following error message. [compute-1-0:03988] *** An error occurred in MPI_Barrier [compute-1-0:03988] *** reported by process [2712141825,0] [compute-1-0:03988] *** on communicator MPI_COMM_WORLD [compute-1-0:03988] *** MPI_ERR_OTHER: known error not in list [compute-1-0:03988] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, [compute-1-0:03988] *** and potentially your MPI job) [sabine.cacds.uh.edu:21046<http://sabine.cacds.uh.edu:21046>] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal [sabine.cacds.uh.edu:21046<http://sabine.cacds.uh.edu:21046>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages I once again double checked that this works correctly in the 3.0 (and 3.1, although I did not run that test this time). 3. The openib btl component is always getting in the way with annoying warnings. It is not really used, but constantly complains: [sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected [sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages [sabine.cacds.uh.edu:25996<http://sabine.cacds.uh.edu:25996>] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init So bottom line, if I do mpirun –mca btl^openib –mca mtl^ofi …. my tests finish correctly, although mpirun will still return an error. Thanks Edgar From: devel [mailto:devel-boun...@lists.open-mpi.org] On Behalf Of Geoffrey Paulsen Sent: Sunday, September 16, 2018 2:31 PM To: devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> Subject: [OMPI devel] Announcing Open MPI v4.0.0rc1 The first release candidate for the Open MPI v4.0.0 release is posted at https://www.open-mpi.org/software/ompi/v4.0/ Major changes include: 4.0.0 -- September, 2018 ------------------------ - OSHMEM updated to the OpenSHMEM 1.4 API. - Do not build Open SHMEM layer when there are no SPMLs available. Currently, this means the Open SHMEM layer will only build if a MXM or UCX library is found. - A UCX BTL was added for enhanced MPI RMA support using UCX - With this release, OpenIB BTL now only supports iWarp and RoCE by default. - Updated internal HWLOC to 2.0.1 - Updated internal PMIx to 3.0.1 - Change the priority for selecting external verses internal HWLOC and PMIx packages to build. Starting with this release, configure by default selects available external HWLOC and PMIx packages over the internal ones. - Updated internal ROMIO to 3.2.1. - Removed support for the MXM MTL. - Improved CUDA support when using UCX. - Improved support for two phase MPI I/O operations when using OMPIO. - Added support for Software-based Performance Counters, see https://github.com/davideberius/ompi/wiki/How-to-Use-Software-Based-Performance-Counters-(SPCs)-in-Open-MPI- Various improvements to MPI RMA performance when using RDMA capable interconnects. - Update memkind component to use the memkind 1.6 public API. - Fix problems with use of newer map-by mpirun options. Thanks to Tony Reina for reporting. - Fix rank-by algorithms to properly rank by object and span - Allow for running as root of two environment variables are set. Requested by Axel Huebl. - Fix a problem with building the Java bindings when using Java 10. Thanks to Bryce Glover for reporting. Our goal is to release 4.0.0 by mid Oct, so any testing is appreciated. _______________________________________________ devel mailing list devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel