Hi,
I ran some tests on Summitdev here at ORNL:
- the UCX problem is solved and I get the expected results for the tests that I
am running (netpipe and IMB).
- without UCX:
* the performance numbers are below what would be expected but I
believe at this point that the slight performance deficiency is due to other
users using other parts of the system.
* I also encountered the following problem while running IMB_EXT and I
now realize that I had the same problem with 2.4.1rc1 but did not catch it at
the time:
[summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
[summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
==== backtrace ====
2 0x0000000000073864 mxm_handle_error()
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
3 0x0000000000073fa4 mxm_error_signal_handler()
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
4 0x0000000000017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0
5 0x00000000000d4634 ompi_osc_base_select() ??:0
6 0x0000000000065e84 ompi_win_create() ??:0
7 0x00000000000a2488 PMPI_Win_create() ??:0
8 0x000000001000b28c IMB_window() ??:0
9 0x0000000010005764 IMB_init_buffers_iter() ??:0
10 0x0000000010001ef8 main() ??:0
11 0x0000000000024980 generic_start_main.isra.0() libc-start.c:0
12 0x0000000000024b74 __libc_start_main() ??:0
===================
==== backtrace ====
2 0x0000000000073864 mxm_handle_error()
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
3 0x0000000000073fa4 mxm_error_signal_handler()
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
4 0x0000000000017b24 ompi_osc_rdma_component_query() osc_rdma_component.c:0
5 0x00000000000d4634 ompi_osc_base_select() ??:0
6 0x0000000000065e84 ompi_win_create() ??:0
7 0x00000000000a2488 PMPI_Win_create() ??:0
8 0x000000001000b28c IMB_window() ??:0
9 0x0000000010005764 IMB_init_buffers_iter() ??:0
10 0x0000000010001ef8 main() ??:0
11 0x0000000000024980 generic_start_main.isra.0() libc-start.c:0
12 0x0000000000024b74 __libc_start_main() ??:0
===================
FYI, the 2.x series is not important to me so it can stay as is. I will move on
testing 3.1.2rc1.
Thanks,
> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel
> <[email protected]> wrote:
>
> Per our discussion over the weekend and on the weekly webex yesterday, we're
> releasing v2.1.5. There are only two changes:
>
> 1. A trivial link issue for UCX.
> 2. A fix for the vader BTL issue. This is how I described it in NEWS:
>
> - A subtle race condition bug was discovered in the "vader" BTL
> (shared memory communications) that, in rare instances, can cause
> MPI processes to crash or incorrectly classify (or effectively drop)
> an MPI message sent via shared memory. If you are using the "ob1"
> PML with "vader" for shared memory communication (note that vader is
> the default for shared memory communication with ob1), you need to
> upgrade to v2.1.5 to fix this issue. You may also upgrade to the
> following versions to fix this issue:
> - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
> series
> - Open MPI v3.1.2 (expected end of August, 2018) or later
>
> This vader fix was warranted serious enough to generate a 2.1.5 release.
> This really will be the end of the 2.1.x series. Trust me; my name is Joe
> Isuzu.
>
> 2.1.5rc1 will be available from the usual location in a few minutes (the
> website will update in about 7 minutes):
>
> https://www.open-mpi.org/software/ompi/v2.1/
>
> --
> Jeff Squyres
> [email protected]
>
> _______________________________________________
> devel mailing list
> [email protected]
> https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
[email protected]
https://lists.open-mpi.org/mailman/listinfo/devel