I would assume so as well and the 2.x series is not really critical for these systems, especially since 3.x is not having the problem. I have no problem ignoring that problem.
> On Aug 17, 2018, at 3:48 PM, Jeff Squyres (jsquyres) via devel > <devel@lists.open-mpi.org> wrote: > > Thanks for the testing. > > I'm assuming the MXM failure has been around for a while, and the correct way > to fix it is to upgrade to a newer Open MPI and/or use UCX. > > >> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R. <valle...@ornl.gov> wrote: >> >> FYI, that segfault problem did not occur when I tested 3.1.2rc1. >> >> Thanks, >> >>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis <pasharesea...@gmail.com> wrote: >>> >>> It looks to me like mxm related failure ? >>> >>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. <valle...@ornl.gov> >>> wrote: >>> Hi, >>> >>> I ran some tests on Summitdev here at ORNL: >>> - the UCX problem is solved and I get the expected results for the tests >>> that I am running (netpipe and IMB). >>> - without UCX: >>> * the performance numbers are below what would be expected but I >>> believe at this point that the slight performance deficiency is due to >>> other users using other parts of the system. >>> * I also encountered the following problem while running IMB_EXT and >>> I now realize that I had the same problem with 2.4.1rc1 but did not catch >>> it at the time: >>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault) >>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault) >>> ==== backtrace ==== >>> 2 0x0000000000073864 mxm_handle_error() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 >>> 3 0x0000000000073fa4 mxm_error_signal_handler() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 >>> 4 0x0000000000017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 >>> 5 0x00000000000d4634 ompi_osc_base_select() ??:0 >>> 6 0x0000000000065e84 ompi_win_create() ??:0 >>> 7 0x00000000000a2488 PMPI_Win_create() ??:0 >>> 8 0x000000001000b28c IMB_window() ??:0 >>> 9 0x0000000010005764 IMB_init_buffers_iter() ??:0 >>> 10 0x0000000010001ef8 main() ??:0 >>> 11 0x0000000000024980 generic_start_main.isra.0() libc-start.c:0 >>> 12 0x0000000000024b74 __libc_start_main() ??:0 >>> =================== >>> ==== backtrace ==== >>> 2 0x0000000000073864 mxm_handle_error() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641 >>> 3 0x0000000000073fa4 mxm_error_signal_handler() >>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616 >>> 4 0x0000000000017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0 >>> 5 0x00000000000d4634 ompi_osc_base_select() ??:0 >>> 6 0x0000000000065e84 ompi_win_create() ??:0 >>> 7 0x00000000000a2488 PMPI_Win_create() ??:0 >>> 8 0x000000001000b28c IMB_window() ??:0 >>> 9 0x0000000010005764 IMB_init_buffers_iter() ??:0 >>> 10 0x0000000010001ef8 main() ??:0 >>> 11 0x0000000000024980 generic_start_main.isra.0() libc-start.c:0 >>> 12 0x0000000000024b74 __libc_start_main() ??:0 >>> =================== >>> >>> FYI, the 2.x series is not important to me so it can stay as is. I will >>> move on testing 3.1.2rc1. >>> >>> Thanks, >>> >>> >>>> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel >>>> <devel@lists.open-mpi.org> wrote: >>>> >>>> Per our discussion over the weekend and on the weekly webex yesterday, >>>> we're releasing v2.1.5. There are only two changes: >>>> >>>> 1. A trivial link issue for UCX. >>>> 2. A fix for the vader BTL issue. This is how I described it in NEWS: >>>> >>>> - A subtle race condition bug was discovered in the "vader" BTL >>>> (shared memory communications) that, in rare instances, can cause >>>> MPI processes to crash or incorrectly classify (or effectively drop) >>>> an MPI message sent via shared memory. If you are using the "ob1" >>>> PML with "vader" for shared memory communication (note that vader is >>>> the default for shared memory communication with ob1), you need to >>>> upgrade to v2.1.5 to fix this issue. You may also upgrade to the >>>> following versions to fix this issue: >>>> - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x >>>> series >>>> - Open MPI v3.1.2 (expected end of August, 2018) or later >>>> >>>> This vader fix was warranted serious enough to generate a 2.1.5 release. >>>> This really will be the end of the 2.1.x series. Trust me; my name is Joe >>>> Isuzu. >>>> >>>> 2.1.5rc1 will be available from the usual location in a few minutes (the >>>> website will update in about 7 minutes): >>>> >>>> https://www.open-mpi.org/software/ompi/v2.1/ >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> devel@lists.open-mpi.org >>>> https://lists.open-mpi.org/mailman/listinfo/devel >>> >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/devel >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org >>> https://lists.open-mpi.org/mailman/listinfo/devel >> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://lists.open-mpi.org/mailman/listinfo/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://lists.open-mpi.org/mailman/listinfo/devel _______________________________________________ devel mailing list devel@lists.open-mpi.org https://lists.open-mpi.org/mailman/listinfo/devel