Re: [OMPI devel] v2.1.5rc1 is out
I would assume so as well and the 2.x series is not really critical for these systems, especially since 3.x is not having the problem. I have no problem ignoring that problem. > On Aug 17, 2018, at 3:48 PM, Jeff Squyres (jsquyres) via devel > wrote: > > Thanks for the testing. > > I'm assuming the MXM failure has been around for a while, and the correct way > to fix it is to upgrade to a newer Open MPI and/or use UCX. > > >> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R. wrote: >> >> FYI, that segfault problem did not occur when I tested 3.1.2rc1. >> >> Thanks, >> >>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis wrote: >>> >>> It looks to me like mxm related failure ? >>> >>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. >>> wrote: >>> Hi, >>> >>> I ran some tests on Summitdev here at ORNL: >>> - the UCX problem is solved and I get the expected results for the tests >>> that I am running (netpipe and IMB). >>> - without UCX: >>> * the performance numbers are below what would be expected but I >>> believe at this point that the slight performance deficiency is due to >>> other users using other parts of the system. >>> * I also encountered the following problem while running IMB_EXT and >>> I now realize that I had the same problem with 2.4.1rc1 but did not catch >>> it at the time: >>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault) >>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault) >>> backtrace >>> 2 0x00073864 mxm_handle_error() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 >>> 3 0x00073fa4 mxm_error_signal_handler() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 >>> 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 >>> 5 0x000d4634 ompi_osc_base_select() ??:0 >>> 6 0x00065e84 ompi_win_create() ??:0 >>> 7 0x000a2488 PMPI_Win_create() ??:0 >>> 8 0x1000b28c IMB_window() ??:0 >>> 9 0x10005764 IMB_init_buffers_iter() ??:0 >>> 10 0x10001ef8 main() ??:0 >>> 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 >>> 12 0x00024b74 __libc_start_main() ??:0 >>> === >>> backtrace >>> 2 0x00073864 mxm_handle_error() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 >>> 3 0x00073fa4 mxm_error_signal_handler() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 >>> 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 >>> 5 0x000d4634 ompi_osc_base_select() ??:0 >>> 6 0x00065e84 ompi_win_create() ??:0 >>> 7 0x000a2488 PMPI_Win_create() ??:0 >>> 8 0x1000b28c IMB_window() ??:0 >>> 9 0x10005764 IMB_init_buffers_iter() ??:0 >>> 10 0x10001ef8 main() ??:0 >>> 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 >>> 12 0x00024b74 __libc_start_main() ??:0 >>> === >>> >>> FYI, the 2.x series is not important to me so it can stay as is. I will >>> move on testing 3.1.2rc1. >>> >>> Thanks, >>> >>> >>>> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel >>>> wrote: >>>> >>>> Per our discussion over the weekend and on the weekly webex yesterday, >>>> we're releasing v2.1.5. There are only two changes: >>>> >>>> 1. A trivial link issue for UCX. >>>> 2. A fix for the vader BTL issue. This is how I described it in NEWS: >>>> >>>> - A subtle race condition bug was discovered in the "vader" BTL >>>> (shared memory communications) that, in rare instances, can cause >>>> MPI processes to crash or incorrectly classify (or effectively drop) >>>> an MPI message sent via shared memory. If you are using the "ob1" >>>> PML with "vader" for shared memory communication (note that vader is >>>> the default for shared memory communication with ob1), you need to >>>> upgrade to v2.1.5 to fix this issue. You may also upgrade to the >>>> following versions to fix this issue: >>>> - Open MPI v3.0.1 (released March, 2018) or lat
Re: [OMPI devel] v3.1.2rc1 is posted
Hi, I tested the RC on Summitdev at ORNL and everything is looking for fine. Thanks, > On Aug 15, 2018, at 6:16 PM, Barrett, Brian via devel > wrote: > > The first release candidate for the 3.1.2 release is posted at > https://www.open-mpi.org/software/ompi/v3.1/ > > Major changes include fixing the race condition in vader (the same one that > caused v2.1.5rc1 to be posted today) as well as: > > - Assorted Portals 4.0 bug fixes. > - Fix for possible data corruption in MPI_BSEND. > - Move shared memory file for vader btl into /dev/shm on Linux. > - Fix for MPI_ISCATTER/MPI_ISCATTERV Fortran interfaces with MPI_IN_PLACE. > - Upgrade PMIx to v2.1.3. > - Numerous One-sided bug fixes. > - Fix for race condition in uGNI BTL. > - Improve handling of large number of interfaces with TCP BTL. > - Numerous UCX bug fixes. > > > Our goal is to release 3.1.2 around the same time as 2.1.5 (hopefully end of > this week), so any testing is appreciated. > > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] v2.1.5rc1 is out
FYI, that segfault problem did not occur when I tested 3.1.2rc1. Thanks, > On Aug 17, 2018, at 10:28 AM, Pavel Shamis wrote: > > It looks to me like mxm related failure ? > > On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. wrote: > Hi, > > I ran some tests on Summitdev here at ORNL: > - the UCX problem is solved and I get the expected results for the tests that > I am running (netpipe and IMB). > - without UCX: > * the performance numbers are below what would be expected but I > believe at this point that the slight performance deficiency is due to other > users using other parts of the system. > * I also encountered the following problem while running IMB_EXT and > I now realize that I had the same problem with 2.4.1rc1 but did not catch it > at the time: > [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault) > [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault) > backtrace > 2 0x00073864 mxm_handle_error() > /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 > 3 0x00073fa4 mxm_error_signal_handler() > /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 > 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 > 5 0x000d4634 ompi_osc_base_select() ??:0 > 6 0x00065e84 ompi_win_create() ??:0 > 7 0x000a2488 PMPI_Win_create() ??:0 > 8 0x1000b28c IMB_window() ??:0 > 9 0x10005764 IMB_init_buffers_iter() ??:0 > 10 0x10001ef8 main() ??:0 > 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 > 12 0x00024b74 __libc_start_main() ??:0 > === > backtrace > 2 0x00073864 mxm_handle_error() > /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 > 3 0x00073fa4 mxm_error_signal_handler() > /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 > 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 > 5 0x000d4634 ompi_osc_base_select() ??:0 > 6 0x00065e84 ompi_win_create() ??:0 > 7 0x000a2488 PMPI_Win_create() ??:0 > 8 0x1000b28c IMB_window() ??:0 > 9 0x10005764 IMB_init_buffers_iter() ??:0 > 10 0x10001ef8 main() ??:0 > 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 > 12 0x00024b74 __libc_start_main() ??:0 > === > > FYI, the 2.x series is not important to me so it can stay as is. I will move > on testing 3.1.2rc1. > > Thanks, > > > > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel > > wrote: > > > > Per our discussion over the weekend and on the weekly webex yesterday, > > we're releasing v2.1.5. There are only two changes: > > > > 1. A trivial link issue for UCX. > > 2. A fix for the vader BTL issue. This is how I described it in NEWS: > > > > - A subtle race condition bug was discovered in the "vader" BTL > > (shared memory communications) that, in rare instances, can cause > > MPI processes to crash or incorrectly classify (or effectively drop) > > an MPI message sent via shared memory. If you are using the "ob1" > > PML with "vader" for shared memory communication (note that vader is > > the default for shared memory communication with ob1), you need to > > upgrade to v2.1.5 to fix this issue. You may also upgrade to the > > following versions to fix this issue: > > - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x > >series > > - Open MPI v3.1.2 (expected end of August, 2018) or later > > > > This vader fix was warranted serious enough to generate a 2.1.5 release. > > This really will be the end of the 2.1.x series. Trust me; my name is Joe > > Isuzu. > > > > 2.1.5rc1 will be available from the usual location in a few minutes (the > > website will update in about 7 minutes): > > > >https://www.open-mpi.org/software/ompi/v2.1/ > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > > > ___ > > devel mailing list > > devel@lists.open-mpi.org > > https://lists.open-mpi.org/mailman/listinfo/devel > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] v2.1.5rc1 is out
Hi, I ran some tests on Summitdev here at ORNL: - the UCX problem is solved and I get the expected results for the tests that I am running (netpipe and IMB). - without UCX: * the performance numbers are below what would be expected but I believe at this point that the slight performance deficiency is due to other users using other parts of the system. * I also encountered the following problem while running IMB_EXT and I now realize that I had the same problem with 2.4.1rc1 but did not catch it at the time: [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault) [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00073864 mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 3 0x00073fa4 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 5 0x000d4634 ompi_osc_base_select() ??:0 6 0x00065e84 ompi_win_create() ??:0 7 0x000a2488 PMPI_Win_create() ??:0 8 0x1000b28c IMB_window() ??:0 9 0x10005764 IMB_init_buffers_iter() ??:0 10 0x10001ef8 main() ??:0 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 12 0x00024b74 __libc_start_main() ??:0 === backtrace 2 0x00073864 mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 3 0x00073fa4 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 5 0x000d4634 ompi_osc_base_select() ??:0 6 0x00065e84 ompi_win_create() ??:0 7 0x000a2488 PMPI_Win_create() ??:0 8 0x1000b28c IMB_window() ??:0 9 0x10005764 IMB_init_buffers_iter() ??:0 10 0x10001ef8 main() ??:0 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 12 0x00024b74 __libc_start_main() ??:0 === FYI, the 2.x series is not important to me so it can stay as is. I will move on testing 3.1.2rc1. Thanks, > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel > wrote: > > Per our discussion over the weekend and on the weekly webex yesterday, we're > releasing v2.1.5. There are only two changes: > > 1. A trivial link issue for UCX. > 2. A fix for the vader BTL issue. This is how I described it in NEWS: > > - A subtle race condition bug was discovered in the "vader" BTL > (shared memory communications) that, in rare instances, can cause > MPI processes to crash or incorrectly classify (or effectively drop) > an MPI message sent via shared memory. If you are using the "ob1" > PML with "vader" for shared memory communication (note that vader is > the default for shared memory communication with ob1), you need to > upgrade to v2.1.5 to fix this issue. You may also upgrade to the > following versions to fix this issue: > - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x >series > - Open MPI v3.1.2 (expected end of August, 2018) or later > > This vader fix was warranted serious enough to generate a 2.1.5 release. > This really will be the end of the 2.1.x series. Trust me; my name is Joe > Isuzu. > > 2.1.5rc1 will be available from the usual location in a few minutes (the > website will update in about 7 minutes): > >https://www.open-mpi.org/software/ompi/v2.1/ > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] v2.1.5rc1 is out
Hi, I ran some tests on Summitdev here at ORNL: - the UCX problem is solved and I get the expected results for the tests that I am running (netpipe and IMB). - without UCX: * the performance numbers are below what would be expected but I believe at this point that the slight performance deficiency is due to other users using other parts of the system. * I also encountered the following problem while running IMB_EXT and I now realize that I had the same problem with 2.4.1rc1 but did not catch it at the time: [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault) [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault) backtrace 2 0x00073864 mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 3 0x00073fa4 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 5 0x000d4634 ompi_osc_base_select() ??:0 6 0x00065e84 ompi_win_create() ??:0 7 0x000a2488 PMPI_Win_create() ??:0 8 0x1000b28c IMB_window() ??:0 9 0x10005764 IMB_init_buffers_iter() ??:0 10 0x10001ef8 main() ??:0 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 12 0x00024b74 __libc_start_main() ??:0 === backtrace 2 0x00073864 mxm_handle_error() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 3 0x00073fa4 mxm_error_signal_handler() /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 4 0x00017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 5 0x000d4634 ompi_osc_base_select() ??:0 6 0x00065e84 ompi_win_create() ??:0 7 0x000a2488 PMPI_Win_create() ??:0 8 0x1000b28c IMB_window() ??:0 9 0x10005764 IMB_init_buffers_iter() ??:0 10 0x10001ef8 main() ??:0 11 0x00024980 generic_start_main.isra.0() libc-start.c:0 12 0x00024b74 __libc_start_main() ??:0 === FYI, the 2.x series is not important to me so it can stay as is. I will move on testing 3.1.2rc1. Thanks, > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel > wrote: > > Per our discussion over the weekend and on the weekly webex yesterday, we're > releasing v2.1.5. There are only two changes: > > 1. A trivial link issue for UCX. > 2. A fix for the vader BTL issue. This is how I described it in NEWS: > > - A subtle race condition bug was discovered in the "vader" BTL > (shared memory communications) that, in rare instances, can cause > MPI processes to crash or incorrectly classify (or effectively drop) > an MPI message sent via shared memory. If you are using the "ob1" > PML with "vader" for shared memory communication (note that vader is > the default for shared memory communication with ob1), you need to > upgrade to v2.1.5 to fix this issue. You may also upgrade to the > following versions to fix this issue: > - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x >series > - Open MPI v3.1.2 (expected end of August, 2018) or later > > This vader fix was warranted serious enough to generate a 2.1.5 release. > This really will be the end of the 2.1.x series. Trust me; my name is Joe > Isuzu. > > 2.1.5rc1 will be available from the usual location in a few minutes (the > website will update in about 7 minutes): > >https://www.open-mpi.org/software/ompi/v2.1/ > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open MPI v2.1.4rc1
Hi, I tested on Summitdev here at ORNL and here are my comments (but I only have a limited set of data for summitdev so my feedback is somewhat limited): - netpipe/mpi is showing a slightly lower bandwidth than the 3.x series (I do not believe it is a problem). - I am facing a problem with UCX, it is unclear to me that it is relevant since I am using UCX master and I do not know whether it is expected to work with OMPI v2.1.x. Note that I am using the same tool for testing all other releases of Open MPI and I never had that problem before, having in mind that I only tested the 3.x series so far. make[2]: Entering directory `/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi/mca/pml/ucx' /bin/sh ../../../../libtool --tag=CC --mode=link gcc -std=gnu99 -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -module -avoid-version -o mca_pml_ucx.la -rpath /ccs/home/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_install/lib/openmpi pml_ucx.lo pml_ucx_request.lo pml_ucx_datatype.lo pml_ucx_component.lo -lucp -lrt -lm -lutil libtool: link: gcc -std=gnu99 -shared -fPIC -DPIC .libs/pml_ucx.o .libs/pml_ucx_request.o .libs/pml_ucx_datatype.o .libs/pml_ucx_component.o -lucp -lrt -lm -lutil -O3 -pthread -pthread -Wl,-soname -Wl,mca_pml_ucx.so -o .libs/mca_pml_ucx.so /usr/bin/ld: cannot find -lucp collect2: error: ld returned 1 exit status make[2]: *** [mca_pml_ucx.la] Error 1 make[2]: Leaving directory `/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi/mca/pml/ucx' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi' make: *** [all-recursive] Error 1 My 2 cents, > On Aug 6, 2018, at 5:04 PM, Jeff Squyres (jsquyres) via devel > wrote: > > Open MPI v2.1.4rc1 has been pushed. It is likely going to be the last in the > v2.1.x series (since v4.0.0 is now visible on the horizon). It is just a > bunch of bug fixes that have accumulated since v2.1.3; nothing huge. We'll > encourage users who are still using the v2.1.x series to upgrade to this > release; it should be a non-event for anyone who has already upgraded to the > v3.0.x or v3.1.x series. > >https://www.open-mpi.org/software/ompi/v2.1/ > > If no serious-enough issues are found, we plan to release 2.1.4 this Friday, > August 10, 2018. > > Please test! > > Bug fixes/minor improvements: > - Disable the POWER 7/BE block in configure. Note that POWER 7/BE is > still not a supported platform, but it is no longer automatically > disabled. See > https://github.com/open-mpi/ompi/issues/4349#issuecomment-374970982 > for more information. > - Fix bug with request-based one-sided MPI operations when using the > "rdma" component. > - Fix issue with large data structure in the TCP BTL causing problems > in some environments. Thanks to @lgarithm for reporting the issue. > - Minor Cygwin build fixes. > - Minor fixes for the openib BTL: > - Support for the QLogic RoCE HCA > - Support for the Boradcom Cumulus RoCE HCA > - Enable support for HDR link speeds > - Fix MPI_FINALIZED hang if invoked from an attribute destructor > during the MPI_COMM_SELF destruction in MPI_FINALIZE. Thanks to > @AndrewGaspar for reporting the issue. > - Java fixes: > - Modernize Java framework detection, especially on OS X/MacOS. >Thanks to Bryce Glover for reporting and submitting the fixes. > - Prefer "javac -h" to "javah" to support newer Java frameworks. > - Fortran fixes: > - Use conformant dummy parameter names for Fortran bindings. Thanks >to Themos Tsikas for reporting and submitting the fixes. > - Build the MPI_SIZEOF() interfaces in the "TKR"-style "mpi" module >whenever possible. Thanks to Themos Tsikas for reporting the >issue. > - Fix array of argv handling for the Fortran bindings of >MPI_COMM_SPAWN_MULTIPLE (and its associated man page). > - Make NAG Fortran compiler support more robust in configure. > - Disable the "pt2pt" one-sided MPI component when MPI_THREAD_MULTIPLE > is used. This component is simply not safe in MPI_THREAD_MULTIPLE > scenarios, and will not be fixed in the v2.1.x series. > - Make the "external" hwloc component fail gracefully if it is tries > to use an hwloc v2.x.y installation. hwloc v2.x.y will not be > supported in the Open MPI v2.1.x series. > - Fix "vader" shared memory support for messages larger than 2GB. > Thanks to Heiko Bauke for the bug report. > - Configure fixes for external PMI directory detection. Thanks to > Davide Vanzo for the report. > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list
Re: [OMPI devel] v3.1.1rc2 posted
Hi, I do not see a 3.1.1rc2 but instead a final 3.1.1, is it normal? Anyway, I tested the 3.1.1 tarball on 8 summit nodes with netpipe and imb. I did not see any problem and performance numbers look good. Thanks From: Barrett, Brian via devel Date: July 1, 2018 at 6:31:26 PM EDT To: Open MPI Developers Cc: Barrett, Brian Subject: [OMPI devel] v3.1.1rc2 posted v3.1.1rc2 is posted at the usual place: https://www.open-mpi.org/software/ompi/v3.1/ Primary changes are some important UCX bug fixes and a forward compatibility fix in PMIx. We’re targeting a release on Friday, please test and send results before then. Thanks, Brian ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] Open MPI 3.1.1rc1 posted
Hi, Sorry for the slow feedback but hopefully I have now what I need to give feedback in a more timely manner... I tested the RC on Summitdev at ORNL (https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/) by running a simple test (I will be running more tests for RCs in a near future) and everything seems to be fine. Thanks, > On Jun 14, 2018, at 8:05 PM, Barrett, Brian via devel > wrote: > > The first release candidate for Open MPI 3.1.1 is posted at > https://www.open-mpi.org/software/ompi/v3.1/. We’re a bit behind on getting > it out the door, so appreciate any testing feedback you have. > > Brian > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
Re: [OMPI devel] About supporting HWLOC 2.0.x
I totally missed that PR before I sent my email, sorry. It pretty much covers all the modifications I made. :) Let me know if I can help in any way. Thanks, > On May 22, 2018, at 11:49 AM, Jeff Squyres (jsquyres) > wrote: > > Geoffroy -- check out https://github.com/open-mpi/ompi/pull/4677. > > If all those issues are now moot, great. I really haven't followed up much > since I made the initial PR; I'm happy to have someone else take it over... > > >> On May 22, 2018, at 11:46 AM, Vallee, Geoffroy R. wrote: >> >> Hi, >> >> HWLOC 2.0.x support was brought up during the call. FYI, I am currently >> using (and still testing) hwloc 2.0.1 as an external library with master and >> I did not face any major problem; I only had to fix minor things, mainly for >> putting the HWLOC topology in a shared memory segment. Let me know if you >> want me to help with the effort of supporting HWLOC 2.0.x. >> >> Thanks, >> ___ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > ___ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel > ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
[OMPI devel] About supporting HWLOC 2.0.x
Hi, HWLOC 2.0.x support was brought up during the call. FYI, I am currently using (and still testing) hwloc 2.0.1 as an external library with master and I did not face any major problem; I only had to fix minor things, mainly for putting the HWLOC topology in a shared memory segment. Let me know if you want me to help with the effort of supporting HWLOC 2.0.x. Thanks, ___ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel
[OMPI devel] v3 branch - Problem with LSF
Hi, I am running some tests on a PPC platform that is using LSF and I see the following problem every time I launch a job that runs on 2 nodes or more: [crest1:49998] *** Process received signal *** [crest1:49998] Signal: Segmentation fault (11) [crest1:49998] Signal code: Address not mapped (1) [crest1:49998] Failing at address: 0x10061636d2d [crest1:49998] [ 0] [0x10050478] [crest1:49998] [ 1] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(+0x0)[0x109c] [crest1:49998] [ 2] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so(straddr_isIPv4+0x44)[0x10e31b64] [crest1:49998] [ 3] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_array2LIST+0x114)[0x10be79b4] [crest1:49998] [ 4] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_constructList+0xfc)[0x10becdbc] [crest1:49998] [ 5] /opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_launch+0x184)[0x10bed9c4] [crest1:49998] [ 6] /ccs/home/gvh/install/crest/ompi3_llvm/lib/openmpi/mca_plm_lsf.so(+0x2660)[0x10992660] [crest1:49998] [ 7] /ccs/home/gvh/install/crest/ompi3_llvm/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x940)[0x101f7730] [crest1:49998] [ 8] /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x100013e4] [crest1:49998] [ 9] /ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x1f10] [crest1:49998] [10] /lib64/power8/libc.so.6(+0x24580)[0x104f4580] [crest1:49998] [11] /lib64/power8/libc.so.6(__libc_start_main+0xc4)[0x104f4774] [crest1:49998] *** End of error message *** I do not experience that problem with master and the only difference about the LSF support between master and the v3 branch is: https://github.com/open-mpi/ompi/commit/92c996487c589ef8558a087ce2a9923dacdf0b99 If I can confirm that this change fixes the problem with the v3 branch, would you guys accept to bring it into the v3 branch? Thanks, ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics
I just tried the fix and i can confirm that it fixes the problem. :) Thanks!!! > On Sep 2, 2016, at 6:18 AM, Jeff Squyres (jsquyres) > wrote: > > Issue filed at https://github.com/open-mpi/ompi/issues/2044. > > I asked Nathan and Sylvain to have a look. > > >> On Sep 1, 2016, at 9:20 PM, Paul Hargrove wrote: >> >> I failed to get PGI 16.x working at all (licence issue, I think). >> So, I can neither confirm nor refute Geoffroy's reported problems. >> >> -Paul >> >> On Thu, Sep 1, 2016 at 6:15 PM, Vallee, Geoffroy R. >> wrote: >> Interesting, I am having the problem with both 16.5 and 16.7. >> >> My 2 cents, >> >>> On Sep 1, 2016, at 8:25 PM, Paul Hargrove wrote: >>> >>> FWIW I have not seen problems when testing the 2.0.1rc2 w/ PGI versions >>> 12.10, 13.9, 14.3 or 15.9. >>> >>> I am going to test 2.0.2.rc3 ASAP and try to get PGI 16.4 coverage added in >>> >>> -Paul >>> >>> On Thu, Sep 1, 2016 at 12:48 PM, Jeff Squyres (jsquyres) >>> wrote: >>> Please send all the information on the build support page and open an issue >>> at github. Thanks. >>> >>> >>>> On Sep 1, 2016, at 3:41 PM, Vallee, Geoffroy R. wrote: >>>> >>>> This is indeed a little better but still creating a problem: >>>> >>>> CCLD opal_wrapper >>>> ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function >>>> `_opal_progress_unregister': >>>> /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459: >>>> undefined reference to `opal_atomic_swap_64' >>>> ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function >>>> `_opal_progress_register': >>>> /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398: >>>> undefined reference to `opal_atomic_swap_64' >>>> make[2]: *** [opal_wrapper] Error 2 >>>> make[2]: Leaving directory >>>> `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers' >>>> make[1]: *** [all-recursive] Error 1 >>>> make[1]: Leaving directory >>>> `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal' >>>> make: *** [all-recursive] Error 1 >>>> >>>> $ nm libopen-pal.a | grep atomic >>>>U opal_atomic_cmpset_64 >>>> 0ab0 t opal_atomic_cmpset_ptr >>>>U opal_atomic_wmb >>>> 0950 t opal_lifo_push_atomic >>>>U opal_atomic_cmpset_acq_32 >>>> 03d0 t opal_atomic_lock >>>> 0450 t opal_atomic_unlock >>>>U opal_atomic_wmb >>>>U opal_atomic_ll_64 >>>>U opal_atomic_sc_64 >>>>U opal_atomic_wmb >>>> 1010 t opal_lifo_pop_atomic >>>>U opal_atomic_cmpset_acq_32 >>>> 04b0 t opal_atomic_init >>>> 04e0 t opal_atomic_lock >>>>U opal_atomic_mb >>>> 0560 t opal_atomic_unlock >>>>U opal_atomic_wmb >>>>U opal_atomic_add_32 >>>>U opal_atomic_cmpset_acq_32 >>>> 0820 t opal_atomic_init >>>> 0850 t opal_atomic_lock >>>>U opal_atomic_sub_32 >>>>U opal_atomic_swap_64 >>>> 08d0 t opal_atomic_unlock >>>>U opal_atomic_wmb >>>> 0130 t opal_atomic_init >>>> atomic-asm.o: >>>> 0138 T opal_atomic_add_32 >>>> 0018 T opal_atomic_cmpset_32 >>>> 00c4 T opal_atomic_cmpset_64 >>>> 003c T opal_atomic_cmpset_acq_32 >>>> 00e8 T opal_atomic_cmpset_acq_64 >>>> 0070 T opal_atomic_cmpset_rel_32 >>>> 0110 T opal_atomic_cmpset_rel_64 >>>> T opal_atomic_mb >>>> 0008 T opal_atomic_rmb >>>> 0150 T opal_atomic_sub_32 >>>> 0010 T opal_atomic_wmb >>>> 2280 t mca_base_pvar_is_atomic >>>>U opal_atomic_ll_64 >>>>U opal_atomic_sc_64 >>>>U opal_atomic_wmb >>>> 0
Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics
Interesting, I am having the problem with both 16.5 and 16.7. My 2 cents, > On Sep 1, 2016, at 8:25 PM, Paul Hargrove wrote: > > FWIW I have not seen problems when testing the 2.0.1rc2 w/ PGI versions > 12.10, 13.9, 14.3 or 15.9. > > I am going to test 2.0.2.rc3 ASAP and try to get PGI 16.4 coverage added in > > -Paul > > On Thu, Sep 1, 2016 at 12:48 PM, Jeff Squyres (jsquyres) > wrote: > Please send all the information on the build support page and open an issue > at github. Thanks. > > > > On Sep 1, 2016, at 3:41 PM, Vallee, Geoffroy R. wrote: > > > > This is indeed a little better but still creating a problem: > > > > CCLD opal_wrapper > > ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function > > `_opal_progress_unregister': > > /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459: > > undefined reference to `opal_atomic_swap_64' > > ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function > > `_opal_progress_register': > > /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398: > > undefined reference to `opal_atomic_swap_64' > > make[2]: *** [opal_wrapper] Error 2 > > make[2]: Leaving directory > > `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers' > > make[1]: *** [all-recursive] Error 1 > > make[1]: Leaving directory > > `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal' > > make: *** [all-recursive] Error 1 > > > > $ nm libopen-pal.a | grep atomic > > U opal_atomic_cmpset_64 > > 0ab0 t opal_atomic_cmpset_ptr > > U opal_atomic_wmb > > 0950 t opal_lifo_push_atomic > > U opal_atomic_cmpset_acq_32 > > 03d0 t opal_atomic_lock > > 0450 t opal_atomic_unlock > > U opal_atomic_wmb > > U opal_atomic_ll_64 > > U opal_atomic_sc_64 > > U opal_atomic_wmb > > 1010 t opal_lifo_pop_atomic > > U opal_atomic_cmpset_acq_32 > > 04b0 t opal_atomic_init > > 04e0 t opal_atomic_lock > > U opal_atomic_mb > > 0560 t opal_atomic_unlock > > U opal_atomic_wmb > > U opal_atomic_add_32 > > U opal_atomic_cmpset_acq_32 > > 0820 t opal_atomic_init > > 0850 t opal_atomic_lock > > U opal_atomic_sub_32 > > U opal_atomic_swap_64 > > 08d0 t opal_atomic_unlock > > U opal_atomic_wmb > > 0130 t opal_atomic_init > > atomic-asm.o: > > 0138 T opal_atomic_add_32 > > 0018 T opal_atomic_cmpset_32 > > 00c4 T opal_atomic_cmpset_64 > > 003c T opal_atomic_cmpset_acq_32 > > 00e8 T opal_atomic_cmpset_acq_64 > > 0070 T opal_atomic_cmpset_rel_32 > > 0110 T opal_atomic_cmpset_rel_64 > > T opal_atomic_mb > > 0008 T opal_atomic_rmb > > 0150 T opal_atomic_sub_32 > > 0010 T opal_atomic_wmb > > 2280 t mca_base_pvar_is_atomic > > U opal_atomic_ll_64 > > U opal_atomic_sc_64 > > U opal_atomic_wmb > > 0900 t opal_lifo_pop_atomic > > > >> On Sep 1, 2016, at 3:16 PM, Jeff Squyres (jsquyres) > >> wrote: > >> > >> Can you try the latest v2.0.1 nightly snapshot tarball? > >> > >> > >>> On Sep 1, 2016, at 2:56 PM, Vallee, Geoffroy R. wrote: > >>> > >>> Hello, > >>> > >>> I get the following problem when we compile OpenMPI-2.0.0 (it seems to be > >>> specific to 2.x; the problem did not appear with 1.10.x) with PGI: > >>> > >>> CCLD opal_wrapper > >>> ../../../opal/.libs/libopen-pal.so: undefined reference to > >>> `opal_atomic_sc_64' > >>> ../../../opal/.libs/libopen-pal.so: undefined reference to > >>> `opal_atomic_ll_64' > >>> ../../../opal/.libs/libopen-pal.so: undefined reference to > >>> `opal_atomic_swap_64' > >>> make[1]: *** [opal_wrapper] Error 2 > >>> > >>> It is a little for me to pin point the exact problem but i can see the > >>> following: > >>> > >>> $ nm ./.libs/
Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics
This is indeed a little better but still creating a problem: CCLD opal_wrapper ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function `_opal_progress_unregister': /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459: undefined reference to `opal_atomic_swap_64' ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function `_opal_progress_register': /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398: undefined reference to `opal_atomic_swap_64' make[2]: *** [opal_wrapper] Error 2 make[2]: Leaving directory `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal' make: *** [all-recursive] Error 1 $ nm libopen-pal.a | grep atomic U opal_atomic_cmpset_64 0ab0 t opal_atomic_cmpset_ptr U opal_atomic_wmb 0950 t opal_lifo_push_atomic U opal_atomic_cmpset_acq_32 03d0 t opal_atomic_lock 0450 t opal_atomic_unlock U opal_atomic_wmb U opal_atomic_ll_64 U opal_atomic_sc_64 U opal_atomic_wmb 1010 t opal_lifo_pop_atomic U opal_atomic_cmpset_acq_32 04b0 t opal_atomic_init 04e0 t opal_atomic_lock U opal_atomic_mb 0560 t opal_atomic_unlock U opal_atomic_wmb U opal_atomic_add_32 U opal_atomic_cmpset_acq_32 0820 t opal_atomic_init 0850 t opal_atomic_lock U opal_atomic_sub_32 U opal_atomic_swap_64 08d0 t opal_atomic_unlock U opal_atomic_wmb 0130 t opal_atomic_init atomic-asm.o: 0138 T opal_atomic_add_32 0018 T opal_atomic_cmpset_32 00c4 T opal_atomic_cmpset_64 003c T opal_atomic_cmpset_acq_32 00e8 T opal_atomic_cmpset_acq_64 0070 T opal_atomic_cmpset_rel_32 0110 T opal_atomic_cmpset_rel_64 T opal_atomic_mb 0008 T opal_atomic_rmb 0150 T opal_atomic_sub_32 0010 T opal_atomic_wmb 2280 t mca_base_pvar_is_atomic U opal_atomic_ll_64 U opal_atomic_sc_64 U opal_atomic_wmb 0900 t opal_lifo_pop_atomic > On Sep 1, 2016, at 3:16 PM, Jeff Squyres (jsquyres) > wrote: > > Can you try the latest v2.0.1 nightly snapshot tarball? > > >> On Sep 1, 2016, at 2:56 PM, Vallee, Geoffroy R. wrote: >> >> Hello, >> >> I get the following problem when we compile OpenMPI-2.0.0 (it seems to be >> specific to 2.x; the problem did not appear with 1.10.x) with PGI: >> >> CCLD opal_wrapper >> ../../../opal/.libs/libopen-pal.so: undefined reference to >> `opal_atomic_sc_64' >> ../../../opal/.libs/libopen-pal.so: undefined reference to >> `opal_atomic_ll_64' >> ../../../opal/.libs/libopen-pal.so: undefined reference to >> `opal_atomic_swap_64' >> make[1]: *** [opal_wrapper] Error 2 >> >> It is a little for me to pin point the exact problem but i can see the >> following: >> >> $ nm ./.libs/libopen-pal.so | grep atomic >> 00026320 t 0017.plt_call.opal_atomic_add_32 >> 00026250 t 0017.plt_call.opal_atomic_cmpset_32 >> 00026780 t 0017.plt_call.opal_atomic_cmpset_64 >> 000280c0 t 0017.plt_call.opal_atomic_cmpset_acq_32 >> 00028ae0 t 0017.plt_call.opal_atomic_ll_64 >> 00027fe0 t 0017.plt_call.opal_atomic_mb >> 00027d50 t 0017.plt_call.opal_atomic_rmb >> 00028500 t 0017.plt_call.opal_atomic_sc_64 >> 00027670 t 0017.plt_call.opal_atomic_sub_32 >> 00026da0 t 0017.plt_call.opal_atomic_swap_64 >> 00027050 t 0017.plt_call.opal_atomic_wmb >> 0005e6a0 t mca_base_pvar_is_atomic >> 0004715c T opal_atomic_add_32 >> 0004703c T opal_atomic_cmpset_32 >> 000470e8 T opal_atomic_cmpset_64 >> 00047060 T opal_atomic_cmpset_acq_32 >> 0004710c T opal_atomic_cmpset_acq_64 >> 0002a610 t opal_atomic_cmpset_ptr >> 00047094 T opal_atomic_cmpset_rel_32 >> 00047134 T opal_atomic_cmpset_rel_64 >> 00032cc0 t opal_atomic_init >> 00033980 t opal_atomic_init >> 000396a0 t opal_atomic_init >>U opal_atomic_ll_64 >> 0002e460 t opal_atomic_lock >> 00032cf0 t opal_atomic_lock >> 000
[OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics
Hello, I get the following problem when we compile OpenMPI-2.0.0 (it seems to be specific to 2.x; the problem did not appear with 1.10.x) with PGI: CCLD opal_wrapper ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_sc_64' ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_ll_64' ../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_swap_64' make[1]: *** [opal_wrapper] Error 2 It is a little for me to pin point the exact problem but i can see the following: $ nm ./.libs/libopen-pal.so | grep atomic 00026320 t 0017.plt_call.opal_atomic_add_32 00026250 t 0017.plt_call.opal_atomic_cmpset_32 00026780 t 0017.plt_call.opal_atomic_cmpset_64 000280c0 t 0017.plt_call.opal_atomic_cmpset_acq_32 00028ae0 t 0017.plt_call.opal_atomic_ll_64 00027fe0 t 0017.plt_call.opal_atomic_mb 00027d50 t 0017.plt_call.opal_atomic_rmb 00028500 t 0017.plt_call.opal_atomic_sc_64 00027670 t 0017.plt_call.opal_atomic_sub_32 00026da0 t 0017.plt_call.opal_atomic_swap_64 00027050 t 0017.plt_call.opal_atomic_wmb 0005e6a0 t mca_base_pvar_is_atomic 0004715c T opal_atomic_add_32 0004703c T opal_atomic_cmpset_32 000470e8 T opal_atomic_cmpset_64 00047060 T opal_atomic_cmpset_acq_32 0004710c T opal_atomic_cmpset_acq_64 0002a610 t opal_atomic_cmpset_ptr 00047094 T opal_atomic_cmpset_rel_32 00047134 T opal_atomic_cmpset_rel_64 00032cc0 t opal_atomic_init 00033980 t opal_atomic_init 000396a0 t opal_atomic_init U opal_atomic_ll_64 0002e460 t opal_atomic_lock 00032cf0 t opal_atomic_lock 000339b0 t opal_atomic_lock 00047024 T opal_atomic_mb 0004702c T opal_atomic_rmb U opal_atomic_sc_64 00047174 T opal_atomic_sub_32 U opal_atomic_swap_64 0002e4e0 t opal_atomic_unlock 00032d70 t opal_atomic_unlock 00033a30 t opal_atomic_unlock 00047034 T opal_atomic_wmb 000324d0 t opal_lifo_pop_atomic 000cc260 t opal_lifo_pop_atomic 0002a490 t opal_lifo_push_atomic Any idea of how to fix the problem? Thanks, ___ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
Re: [OMPI devel] Open MPI face-to-face devel meeting: Jan/Feb 2016
I don't know if it would make sense to send someone (or even if someone is already supposed to go) but they are planning for the next open mpi developer meeting and since we have so much going on with open mpi, I thought it would make sense to forward this email. Thanks, From: "Jeff Squyres (jsquyres)" Sent: Thursday, October 8, 2015 3:47 PM To: Open MPI Developers List Subject: [OMPI devel] Open MPI face-to-face devel meeting: Jan/Feb 2016 Developers -- It's time to schedule our next face-to-face meeting. IBM has graciously offered the use of their facilities in Dallas, TX. Apparently hotels and the IBM facilities are within a tax ride of the Dallas airport (i.e., much closer than the Cisco facilities). Right now, the facilities are fairly open through Jan and Feb, but they book up fast. So please answer this Doodle by the weekly webex next Tuesday (13 Oct 2015) so that we can pick a week: http://doodle.com/poll/fzr9vebqpsh37ii6 I (pseudo-)arbitrarily picked Tue-Thu meeting days, assuming that people would fly in on Monday, and we could start first thing on Tuesday morning. And then finish up by early afternoon Thursday so people could possibly fly out Thursday afternoon (or Friday, if that's not possible). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2015/10/18150.php
Re: [OMPI devel] [OMPI svn] svn:open-mpi r31577 - trunk/ompi/mca/rte/base
Too bad all this happened so fast otherwise ORNL would have at least participated to the call to understand what is going to happen (since we have a RTE module that we maintain). Any chance we could have a summary? Thanks, On May 1, 2014, at 2:40 PM, Ralph Castain wrote: > Just to report back to the list: the three of us discussed this at some > length, and decided we like George's proposed solution. Looks like a good > clean approach that provides flexibility for the future. So we will introduce > it when the BTLs move down to OPAL as (a) George already has it implemented > there, and (b) we don't really need it before then. > > Thanks George! > Ralph > > > On May 1, 2014, at 9:40 AM, Jeff Squyres (jsquyres) > wrote: > >> Done! >> >> On May 1, 2014, at 11:22 AM, George Bosilca wrote: >> >>> Apparently we are good today at 2PM EST. Fire-up the webex ;) >>> >>> George. >>> >>> On May 1, 2014, at 10:35 , Jeff Squyres (jsquyres) >>> wrote: >>> http://doodle.com/hhm4yyr76ipcxgk2 On May 1, 2014, at 10:25 AM, Ralph Castain wrote: > sure - might be faster that way :-) > > On May 1, 2014, at 6:59 AM, Jeff Squyres (jsquyres) > wrote: > >> Want to have a phone call/webex to discuss? >> >> >> On May 1, 2014, at 9:43 AM, Ralph Castain wrote: >> >>> The problem we'll have with BTLs in opal is going to revolve around >>> that ompi_process_name_t and will occur in a number of places. I've >>> been trying to grok George's statement about accessors and can't figure >>> out a clean way to make that work IF every RTE gets to define the >>> process name a different way. >>> >>> For example, suppose I define ompi_process_name_t to be a string. I can >>> hash the string down to an opal_identifier_t, but that is a >>> structureless 64-bit value - there is no concept of a jobid or vpid in >>> it. So if you now want to extract a jobid for that identifier, the only >>> way you can do it is to "up-call" back to the RTE to parse it. >>> >>> This means that every RTE would have to initialize OPAL with a >>> registration of its opal_identifier parser function(s), which seems >>> like a really ugly solution. >>> >>> Maybe it is time to shift the process identifier down to the opal >>> layer? If we define opal_identifier_t to include the required >>> jobid/vpid, perhaps adding a void* so someone can put whatever they >>> want in it? >>> >>> Note that I'm not wild about extending the identifier size beyond >>> 64-bits as the memory footprint issue is growing in concern, and I >>> still haven't seen any real use-case proposed for extending it. >>> >>> >>> On May 1, 2014, at 3:41 AM, Jeff Squyres (jsquyres) >>> wrote: >>> On Apr 30, 2014, at 10:01 PM, George Bosilca wrote: > Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t > enough to dig for the info of the peer into the opal_db? At the moment, I use the ompi_process_name_t for RML sends/receives in the usnic BTL. I know this will have to change when the BTLs move down to OPAL (when is that going to happen, BTW?). So my future use case may be somewhat moot. More detail === "Why does the usnic BTL use RML sends/receives?", you ask. The reason is rooted in the fact that the usnic BTL uses an unreliable, connectionless transport under the covert. We had some customers have network misconfigurations that resulted in usnic traffic not flowing properly (e.g., MTU mismatches in the network). But since we don't have a connection-oriented underlying API that will eventually timeout/fail to connect/etc. when there's a problem with the network configuration, we added a "connection validation" service in the usnic BTL that fires up in a thread in the local rank 0 on each server. This thread provides service to all the MPI processes on its server. In short: the service thread sends UDP pings and ACKs to peer service threads on other servers (upon demand/upon first send between servers) to verify network connectivity. If the pings eventually fail/timeout (i.e., don't get ACKs back), the service thread does a show_help and kills the job. There's more details, but that's the gist of it. This basically gives us the ability to highlight problems in the network and kill the MPI job rather than spin infinitely while trying to deliver MPI/BTL messages to a peer that will never get there. Since this is really a server-to-server network connectiv
[OMPI devel] Direct references to ORTE from OMPI
Hi, Instead of references to the RTE layer, there are a few direct references to ORTE symbols in the current OMPI layer. The attached patches fix the problem. Thanks, proc_c.patch Description: proc_c.patch comm_c.patch Description: comm_c.patch
[OMPI devel] Problem with multiple identical entries in ~/.openmpi/mca-params.conf
Hi, I found a very unexpected behavior with r29217: % cat ~/.openmpi/mca-params.conf #pml_base_verbose=0 pml_base_verbose=0 % mpicc -o helloworld helloworld.c Then if i update the mca-params.conf to have two identical entries, i have segfaults: % cat ~/.openmpi/mca-params.conf pml_base_verbose=0 pml_base_verbose=0 % mpicc -o helloworld helloworld.c [node0:23157] *** Process received signal *** [node0:23157] Signal: Segmentation fault (11) [node0:23157] Signal code: Address not mapped (1) [node0:23157] Failing at address: 0x7f4812770100 ^C Note that the compilation hangs. Also note that i have the exact same problem when running an MPI application that was successfully compiled: % cat ~/.openmpi/mca-params.conf pml_base_verbose=0 #pml_base_verbose=0 % mpirun -np 2 ./helloworld Hello, World (node0) Hello, World (node0) % mpirun -np 2 ./helloworld Hello, World (node0) Hello, World (node0) [node0:23201] *** Process received signal *** [node0:23201] Signal: Segmentation fault (11) [node0:23201] Signal code: Address not mapped (1) [node0:23201] Failing at address: 0x7f5a8f632c80 [node0:23202] *** Process received signal *** [node0:23202] Signal: Segmentation fault (11) [node0:23202] Signal code: Address not mapped (1) [node0:23202] Failing at address: 0x7f1436605650 ^C[node0:23199] *** Process received signal *** [node0:23199] Signal: Segmentation fault (11) [node0:23199] Signal code: Address not mapped (1) [node0:23199] Failing at address: 0x7f9917dd55f0 The problem occurs during opal_finalize() and MCA tries to clean up some variables. Sorry i did not have the time to get a full trace. Best regards,
[OMPI devel] Patch for unnecessary use of a ORTE constant
Hi, Small patch that remove the use of a ORTE constants that is not justified; the OPAL one should be used instead. Thanks, ompi_info_support.patch Description: ompi_info_support.patch
Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures
Thanks Ralph. And sorry for not including the rte/orte/rte_orte.h modification in my patch, i am not using ORTE at the moment. On Feb 22, 2013, at 12:49 PM, Ralph Castain wrote: > Hmmwell that doesn't solve the problem either - we also have to typedef > ompi_local_rank_t. I've committed the complete fix. > > Thanks > Ralph > > > On Feb 22, 2013, at 9:15 AM, "Vallee, Geoffroy R." wrote: > >> Well apparently not… another try… sorry for the extra noise. >> >> >> >> >> On Feb 22, 2013, at 12:08 PM, "Vallee, Geoffroy R." >> wrote: >> >>> This patch will actually apply correctly, not the first one. Sorry about >>> that. >>> >>> >>> On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R." >>> wrote: >>> >>>> Hello, >>>> >>>> Some of the latest modifications to the SM BTL make a direct reference to >>>> ORTE instead of the equivalent at the OMPI level. >>>> >>>> The attached patch fixes that problem. >>>> >>>> Thanks, >>>> >>>> ___ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures
Well apparently not… another try… sorry for the extra noise. btl_sm_component_c.patch Description: btl_sm_component_c.patch On Feb 22, 2013, at 12:08 PM, "Vallee, Geoffroy R." wrote: > This patch will actually apply correctly, not the first one. Sorry about that. > > > On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R." wrote: > >> Hello, >> >> Some of the latest modifications to the SM BTL make a direct reference to >> ORTE instead of the equivalent at the OMPI level. >> >> The attached patch fixes that problem. >> >> Thanks, >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures
This patch will actually apply correctly, not the first one. Sorry about that. btl_sm_component_c.patch Description: btl_sm_component_c.patch On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R." wrote: > Hello, > > Some of the latest modifications to the SM BTL make a direct reference to > ORTE instead of the equivalent at the OMPI level. > > The attached patch fixes that problem. > > Thanks, > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures
Hello, Some of the latest modifications to the SM BTL make a direct reference to ORTE instead of the equivalent at the OMPI level. The attached patch fixes that problem. Thanks, btl_sm_component_c.patch Description: btl_sm_component_c.patch
[OMPI devel] ORCA - Another runtime supported
Hello, FYI, we just finished the implementation of a ORCA module for the support of a runtime infrastructure developed at Oak Ridge National Laboratory. For this, we are currently using the version of ORCA available on the bitbucket branch: https://bitbucket.org/jjhursey/ompi-orca ORCA clearly makes the integration easier and more maintainable; we hope it will make its way back into trunk very soon. Thanks, -- Geoffroy Vallee, PhD Research Associate Oak Ridge National Laboratory