I would assume so as well and the 2.x series is not really critical for these 
systems, especially since 3.x is not having the problem. I have no problem 
ignoring that problem.


> On Aug 17, 2018, at 3:48 PM, Jeff Squyres (jsquyres) via devel 
> <devel@lists.open-mpi.org> wrote:
> 
> Thanks for the testing.
> 
> I'm assuming the MXM failure has been around for a while, and the correct way 
> to fix it is to upgrade to a newer Open MPI and/or use UCX.
> 
> 
>> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R. <valle...@ornl.gov> wrote:
>> 
>> FYI, that segfault problem did not occur when I tested 3.1.2rc1.
>> 
>> Thanks,
>> 
>>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis <pasharesea...@gmail.com> wrote:
>>> 
>>> It looks to me like mxm related failure ? 
>>> 
>>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R. <valle...@ornl.gov> 
>>> wrote:
>>> Hi,
>>> 
>>> I ran some tests on Summitdev here at ORNL:
>>> - the UCX problem is solved and I get the expected results for the tests 
>>> that I am running (netpipe and IMB).
>>> - without UCX:
>>>       * the performance numbers are below what would be expected but I 
>>> believe at this point that the slight performance deficiency is due to 
>>> other users using other parts of the system. 
>>>       * I also encountered the following problem while running IMB_EXT and 
>>> I now realize that I had the same problem with 2.4.1rc1 but did not catch 
>>> it at the time:
>>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
>>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>>> ==== backtrace ====
>>> 2 0x0000000000073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x0000000000073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x0000000000017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x00000000000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x0000000000065e84 ompi_win_create()  ??:0
>>> 7 0x00000000000a2488 PMPI_Win_create()  ??:0
>>> 8 0x000000001000b28c IMB_window()  ??:0
>>> 9 0x0000000010005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x0000000010001ef8 main()  ??:0
>>> 11 0x0000000000024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x0000000000024b74 __libc_start_main()  ??:0
>>> ===================
>>> ==== backtrace ====
>>> 2 0x0000000000073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x0000000000073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x0000000000017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x00000000000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x0000000000065e84 ompi_win_create()  ??:0
>>> 7 0x00000000000a2488 PMPI_Win_create()  ??:0
>>> 8 0x000000001000b28c IMB_window()  ??:0
>>> 9 0x0000000010005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x0000000010001ef8 main()  ??:0
>>> 11 0x0000000000024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x0000000000024b74 __libc_start_main()  ??:0
>>> ===================
>>> 
>>> FYI, the 2.x series is not important to me so it can stay as is. I will 
>>> move on testing 3.1.2rc1.
>>> 
>>> Thanks,
>>> 
>>> 
>>>> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>>>> <devel@lists.open-mpi.org> wrote:
>>>> 
>>>> Per our discussion over the weekend and on the weekly webex yesterday, 
>>>> we're releasing v2.1.5.  There are only two changes:
>>>> 
>>>> 1. A trivial link issue for UCX.
>>>> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
>>>> 
>>>> - A subtle race condition bug was discovered in the "vader" BTL
>>>> (shared memory communications) that, in rare instances, can cause
>>>> MPI processes to crash or incorrectly classify (or effectively drop)
>>>> an MPI message sent via shared memory.  If you are using the "ob1"
>>>> PML with "vader" for shared memory communication (note that vader is
>>>> the default for shared memory communication with ob1), you need to
>>>> upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>>>> following versions to fix this issue:
>>>> - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>>>>  series
>>>> - Open MPI v3.1.2 (expected end of August, 2018) or later
>>>> 
>>>> This vader fix was warranted serious enough to generate a 2.1.5 release.  
>>>> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
>>>> Isuzu.
>>>> 
>>>> 2.1.5rc1 will be available from the usual location in a few minutes (the 
>>>> website will update in about 7 minutes):
>>>> 
>>>>  https://www.open-mpi.org/software/ompi/v2.1/
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> devel@lists.open-mpi.org
>>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://lists.open-mpi.org/mailman/listinfo/devel
>> 
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to