Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Vallee, Geoffroy R.
I would assume so as well and the 2.x series is not really critical for these 
systems, especially since 3.x is not having the problem. I have no problem 
ignoring that problem.


> On Aug 17, 2018, at 3:48 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Thanks for the testing.
> 
> I'm assuming the MXM failure has been around for a while, and the correct way 
> to fix it is to upgrade to a newer Open MPI and/or use UCX.
> 
> 
>> On Aug 17, 2018, at 11:01 AM, Vallee, Geoffroy R.  wrote:
>> 
>> FYI, that segfault problem did not occur when I tested 3.1.2rc1.
>> 
>> Thanks,
>> 
>>> On Aug 17, 2018, at 10:28 AM, Pavel Shamis  wrote:
>>> 
>>> It looks to me like mxm related failure ? 
>>> 
>>> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R.  
>>> wrote:
>>> Hi,
>>> 
>>> I ran some tests on Summitdev here at ORNL:
>>> - the UCX problem is solved and I get the expected results for the tests 
>>> that I am running (netpipe and IMB).
>>> - without UCX:
>>>   * the performance numbers are below what would be expected but I 
>>> believe at this point that the slight performance deficiency is due to 
>>> other users using other parts of the system. 
>>>   * I also encountered the following problem while running IMB_EXT and 
>>> I now realize that I had the same problem with 2.4.1rc1 but did not catch 
>>> it at the time:
>>> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
>>> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>>>  backtrace 
>>> 2 0x00073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x00073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x00065e84 ompi_win_create()  ??:0
>>> 7 0x000a2488 PMPI_Win_create()  ??:0
>>> 8 0x1000b28c IMB_window()  ??:0
>>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x10001ef8 main()  ??:0
>>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x00024b74 __libc_start_main()  ??:0
>>> ===
>>>  backtrace 
>>> 2 0x00073864 mxm_handle_error()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>>> 3 0x00073fa4 mxm_error_signal_handler()  
>>> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>>> 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>>> 5 0x000d4634 ompi_osc_base_select()  ??:0
>>> 6 0x00065e84 ompi_win_create()  ??:0
>>> 7 0x000a2488 PMPI_Win_create()  ??:0
>>> 8 0x1000b28c IMB_window()  ??:0
>>> 9 0x10005764 IMB_init_buffers_iter()  ??:0
>>> 10 0x10001ef8 main()  ??:0
>>> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
>>> 12 0x00024b74 __libc_start_main()  ??:0
>>> ===
>>> 
>>> FYI, the 2.x series is not important to me so it can stay as is. I will 
>>> move on testing 3.1.2rc1.
>>> 
>>> Thanks,
>>> 
>>> 
>>>> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>>>>  wrote:
>>>> 
>>>> Per our discussion over the weekend and on the weekly webex yesterday, 
>>>> we're releasing v2.1.5.  There are only two changes:
>>>> 
>>>> 1. A trivial link issue for UCX.
>>>> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
>>>> 
>>>> - A subtle race condition bug was discovered in the "vader" BTL
>>>> (shared memory communications) that, in rare instances, can cause
>>>> MPI processes to crash or incorrectly classify (or effectively drop)
>>>> an MPI message sent via shared memory.  If you are using the "ob1"
>>>> PML with "vader" for shared memory communication (note that vader is
>>>> the default for shared memory communication with ob1), you need to
>>>> upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>>>> following versions to fix this issue:
>>>> - Open MPI v3.0.1 (released March, 2018) or lat

Re: [OMPI devel] v3.1.2rc1 is posted

2018-08-17 Thread Vallee, Geoffroy R.
Hi,

I tested the RC on Summitdev at ORNL and everything is looking for fine.

Thanks,


> On Aug 15, 2018, at 6:16 PM, Barrett, Brian via devel 
>  wrote:
> 
> The first release candidate for the 3.1.2 release is posted at 
> https://www.open-mpi.org/software/ompi/v3.1/
> 
> Major changes include fixing the race condition in vader (the same one that 
> caused v2.1.5rc1 to be posted today) as well as:
> 
> - Assorted Portals 4.0 bug fixes.
> - Fix for possible data corruption in MPI_BSEND.
> - Move shared memory file for vader btl into /dev/shm on Linux.
> - Fix for MPI_ISCATTER/MPI_ISCATTERV Fortran interfaces with MPI_IN_PLACE.
> - Upgrade PMIx to v2.1.3.
> - Numerous One-sided bug fixes.
> - Fix for race condition in uGNI BTL.
> - Improve handling of large number of interfaces with TCP BTL.
> - Numerous UCX bug fixes.
> 
> 
> Our goal is to release 3.1.2 around the same time as 2.1.5 (hopefully end of 
> this week), so any testing is appreciated.
> 
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-17 Thread Vallee, Geoffroy R.
FYI, that segfault problem did not occur when I tested 3.1.2rc1.

Thanks,

> On Aug 17, 2018, at 10:28 AM, Pavel Shamis  wrote:
> 
> It looks to me like mxm related failure ? 
> 
> On Thu, Aug 16, 2018 at 1:51 PM Vallee, Geoffroy R.  wrote:
> Hi,
> 
> I ran some tests on Summitdev here at ORNL:
> - the UCX problem is solved and I get the expected results for the tests that 
> I am running (netpipe and IMB).
> - without UCX:
> * the performance numbers are below what would be expected but I 
> believe at this point that the slight performance deficiency is due to other 
> users using other parts of the system. 
> * I also encountered the following problem while running IMB_EXT and 
> I now realize that I had the same problem with 2.4.1rc1 but did not catch it 
> at the time:
> [summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
> [summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
>  backtrace 
>  2 0x00073864 mxm_handle_error()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
>  backtrace 
>  2 0x00073864 mxm_handle_error()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
>  3 0x00073fa4 mxm_error_signal_handler()  
> /var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
>  4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
>  5 0x000d4634 ompi_osc_base_select()  ??:0
>  6 0x00065e84 ompi_win_create()  ??:0
>  7 0x000a2488 PMPI_Win_create()  ??:0
>  8 0x1000b28c IMB_window()  ??:0
>  9 0x10005764 IMB_init_buffers_iter()  ??:0
> 10 0x10001ef8 main()  ??:0
> 11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
> 12 0x00024b74 __libc_start_main()  ??:0
> ===
> 
> FYI, the 2.x series is not important to me so it can stay as is. I will move 
> on testing 3.1.2rc1.
> 
> Thanks,
> 
> 
> > On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
> >  wrote:
> > 
> > Per our discussion over the weekend and on the weekly webex yesterday, 
> > we're releasing v2.1.5.  There are only two changes:
> > 
> > 1. A trivial link issue for UCX.
> > 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> > 
> > - A subtle race condition bug was discovered in the "vader" BTL
> >  (shared memory communications) that, in rare instances, can cause
> >  MPI processes to crash or incorrectly classify (or effectively drop)
> >  an MPI message sent via shared memory.  If you are using the "ob1"
> >  PML with "vader" for shared memory communication (note that vader is
> >  the default for shared memory communication with ob1), you need to
> >  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
> >  following versions to fix this issue:
> >  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
> >series
> >  - Open MPI v3.1.2 (expected end of August, 2018) or later
> > 
> > This vader fix was warranted serious enough to generate a 2.1.5 release.  
> > This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> > Isuzu.
> > 
> > 2.1.5rc1 will be available from the usual location in a few minutes (the 
> > website will update in about 7 minutes):
> > 
> >https://www.open-mpi.org/software/ompi/v2.1/
> > 
> > -- 
> > Jeff Squyres
> > jsquy...@cisco.com
> > 
> > ___
> > devel mailing list
> > devel@lists.open-mpi.org
> > https://lists.open-mpi.org/mailman/listinfo/devel
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-16 Thread Vallee, Geoffroy R.
Hi,

I ran some tests on Summitdev here at ORNL:
- the UCX problem is solved and I get the expected results for the tests that I 
am running (netpipe and IMB).
- without UCX:
* the performance numbers are below what would be expected but I 
believe at this point that the slight performance deficiency is due to other 
users using other parts of the system. 
* I also encountered the following problem while running IMB_EXT and I 
now realize that I had the same problem with 2.4.1rc1 but did not catch it at 
the time:
[summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
[summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===

FYI, the 2.x series is not important to me so it can stay as is. I will move on 
testing 3.1.2rc1.

Thanks,


> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Per our discussion over the weekend and on the weekly webex yesterday, we're 
> releasing v2.1.5.  There are only two changes:
> 
> 1. A trivial link issue for UCX.
> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> 
> - A subtle race condition bug was discovered in the "vader" BTL
>  (shared memory communications) that, in rare instances, can cause
>  MPI processes to crash or incorrectly classify (or effectively drop)
>  an MPI message sent via shared memory.  If you are using the "ob1"
>  PML with "vader" for shared memory communication (note that vader is
>  the default for shared memory communication with ob1), you need to
>  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>  following versions to fix this issue:
>  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>series
>  - Open MPI v3.1.2 (expected end of August, 2018) or later
> 
> This vader fix was warranted serious enough to generate a 2.1.5 release.  
> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> Isuzu.
> 
> 2.1.5rc1 will be available from the usual location in a few minutes (the 
> website will update in about 7 minutes):
> 
>https://www.open-mpi.org/software/ompi/v2.1/
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] v2.1.5rc1 is out

2018-08-16 Thread Vallee, Geoffroy R.
Hi,

I ran some tests on Summitdev here at ORNL:
- the UCX problem is solved and I get the expected results for the tests that I 
am running (netpipe and IMB).
- without UCX:
* the performance numbers are below what would be expected but I 
believe at this point that the slight performance deficiency is due to other 
users using other parts of the system. 
* I also encountered the following problem while running IMB_EXT and I 
now realize that I had the same problem with 2.4.1rc1 but did not catch it at 
the time:
[summitdev-login1:112517:0] Caught signal 11 (Segmentation fault)
[summitdev-r0c2n13:91094:0] Caught signal 11 (Segmentation fault)
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===
 backtrace 
 2 0x00073864 mxm_handle_error()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:641
 3 0x00073fa4 mxm_error_signal_handler()  
/var/tmp/OFED_topdir/BUILD/mxm-3.6.3102/src/mxm/util/debug/debug.c:616
 4 0x00017b24 ompi_osc_rdma_component_query()  osc_rdma_component.c:0
 5 0x000d4634 ompi_osc_base_select()  ??:0
 6 0x00065e84 ompi_win_create()  ??:0
 7 0x000a2488 PMPI_Win_create()  ??:0
 8 0x1000b28c IMB_window()  ??:0
 9 0x10005764 IMB_init_buffers_iter()  ??:0
10 0x10001ef8 main()  ??:0
11 0x00024980 generic_start_main.isra.0()  libc-start.c:0
12 0x00024b74 __libc_start_main()  ??:0
===

FYI, the 2.x series is not important to me so it can stay as is. I will move on 
testing 3.1.2rc1.

Thanks,


> On Aug 15, 2018, at 6:07 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Per our discussion over the weekend and on the weekly webex yesterday, we're 
> releasing v2.1.5.  There are only two changes:
> 
> 1. A trivial link issue for UCX.
> 2. A fix for the vader BTL issue.  This is how I described it in NEWS:
> 
> - A subtle race condition bug was discovered in the "vader" BTL
>  (shared memory communications) that, in rare instances, can cause
>  MPI processes to crash or incorrectly classify (or effectively drop)
>  an MPI message sent via shared memory.  If you are using the "ob1"
>  PML with "vader" for shared memory communication (note that vader is
>  the default for shared memory communication with ob1), you need to
>  upgrade to v2.1.5 to fix this issue.  You may also upgrade to the
>  following versions to fix this issue:
>  - Open MPI v3.0.1 (released March, 2018) or later in the v3.0.x
>series
>  - Open MPI v3.1.2 (expected end of August, 2018) or later
> 
> This vader fix was warranted serious enough to generate a 2.1.5 release.  
> This really will be the end of the 2.1.x series.  Trust me; my name is Joe 
> Isuzu.
> 
> 2.1.5rc1 will be available from the usual location in a few minutes (the 
> website will update in about 7 minutes):
> 
>https://www.open-mpi.org/software/ompi/v2.1/
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


Re: [OMPI devel] Open MPI v2.1.4rc1

2018-08-09 Thread Vallee, Geoffroy R.
Hi,

I tested on Summitdev here at ORNL and here are my comments (but I only have a 
limited set of data for summitdev so my feedback is somewhat limited):
- netpipe/mpi is showing a slightly lower bandwidth than the 3.x series (I do 
not believe it is a problem).
- I am facing a problem with UCX, it is unclear to me that it is relevant since 
I am using UCX master and I do not know whether it is expected to work with 
OMPI v2.1.x. Note that I am using the same tool for testing all other releases 
of Open MPI and I never had that problem before, having in mind that I only 
tested the 3.x series so far.

make[2]: Entering directory 
`/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi/mca/pml/ucx'
/bin/sh ../../../../libtool  --tag=CC   --mode=link gcc -std=gnu99  -O3 
-DNDEBUG -finline-functions -fno-strict-aliasing -pthread -module 
-avoid-version  -o mca_pml_ucx.la -rpath 
/ccs/home/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_install/lib/openmpi
 pml_ucx.lo pml_ucx_request.lo pml_ucx_datatype.lo pml_ucx_component.lo -lucp  
-lrt -lm -lutil  
libtool: link: gcc -std=gnu99 -shared  -fPIC -DPIC  .libs/pml_ucx.o 
.libs/pml_ucx_request.o .libs/pml_ucx_datatype.o .libs/pml_ucx_component.o   
-lucp -lrt -lm -lutil  -O3 -pthread   -pthread -Wl,-soname -Wl,mca_pml_ucx.so 
-o .libs/mca_pml_ucx.so
/usr/bin/ld: cannot find -lucp
collect2: error: ld returned 1 exit status
make[2]: *** [mca_pml_ucx.la] Error 1
make[2]: Leaving directory 
`/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi/mca/pml/ucx'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
`/autofs/nccs-svm1_home1/gvh/.ompi-release-tester/scratch/summitdev/2.1.4rc1/scratch/UCX/ompi_build/ompi'
make: *** [all-recursive] Error 1

My 2 cents,

> On Aug 6, 2018, at 5:04 PM, Jeff Squyres (jsquyres) via devel 
>  wrote:
> 
> Open MPI v2.1.4rc1 has been pushed.  It is likely going to be the last in the 
> v2.1.x series (since v4.0.0 is now visible on the horizon).  It is just a 
> bunch of bug fixes that have accumulated since v2.1.3; nothing huge.  We'll 
> encourage users who are still using the v2.1.x series to upgrade to this 
> release; it should be a non-event for anyone who has already upgraded to the 
> v3.0.x or v3.1.x series.
> 
>https://www.open-mpi.org/software/ompi/v2.1/
> 
> If no serious-enough issues are found, we plan to release 2.1.4 this Friday, 
> August 10, 2018.
> 
> Please test!
> 
> Bug fixes/minor improvements:
> - Disable the POWER 7/BE block in configure.  Note that POWER 7/BE is
>  still not a supported platform, but it is no longer automatically
>  disabled.  See
>  https://github.com/open-mpi/ompi/issues/4349#issuecomment-374970982
>  for more information.
> - Fix bug with request-based one-sided MPI operations when using the
>  "rdma" component.
> - Fix issue with large data structure in the TCP BTL causing problems
>  in some environments.  Thanks to @lgarithm for reporting the issue.
> - Minor Cygwin build fixes.
> - Minor fixes for the openib BTL:
>  - Support for the QLogic RoCE HCA
>  - Support for the Boradcom Cumulus RoCE HCA
>  - Enable support for HDR link speeds
> - Fix MPI_FINALIZED hang if invoked from an attribute destructor
>  during the MPI_COMM_SELF destruction in MPI_FINALIZE.  Thanks to
>  @AndrewGaspar for reporting the issue.
> - Java fixes:
>  - Modernize Java framework detection, especially on OS X/MacOS.
>Thanks to Bryce Glover for reporting and submitting the fixes.
>  - Prefer "javac -h" to "javah" to support newer Java frameworks.
> - Fortran fixes:
>  - Use conformant dummy parameter names for Fortran bindings.  Thanks
>to Themos Tsikas for reporting and submitting the fixes.
>  - Build the MPI_SIZEOF() interfaces in the "TKR"-style "mpi" module
>whenever possible.  Thanks to Themos Tsikas for reporting the
>issue.
>  - Fix array of argv handling for the Fortran bindings of
>MPI_COMM_SPAWN_MULTIPLE (and its associated man page).
>  - Make NAG Fortran compiler support more robust in configure.
> - Disable the "pt2pt" one-sided MPI component when MPI_THREAD_MULTIPLE
>  is used.  This component is simply not safe in MPI_THREAD_MULTIPLE
>  scenarios, and will not be fixed in the v2.1.x series.
> - Make the "external" hwloc component fail gracefully if it is tries
>  to use an hwloc v2.x.y installation.  hwloc v2.x.y will not be
>  supported in the Open MPI v2.1.x series.
> - Fix "vader" shared memory support for messages larger than 2GB.
>  Thanks to Heiko Bauke for the bug report.
> - Configure fixes for external PMI directory detection.  Thanks to
>  Davide Vanzo for the report.
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list

Re: [OMPI devel] v3.1.1rc2 posted

2018-07-02 Thread Vallee, Geoffroy R.
Hi,

I do not see a 3.1.1rc2 but instead a final 3.1.1, is it normal? Anyway, I 
tested the 3.1.1 tarball on 8 summit nodes with netpipe and imb. I did not see 
any problem and performance numbers look good.

Thanks




From: Barrett, Brian via devel 
Date: July 1, 2018 at 6:31:26 PM EDT
To: Open MPI Developers 
Cc: Barrett, Brian 
Subject: [OMPI devel] v3.1.1rc2 posted


v3.1.1rc2 is posted at the usual place: 
https://www.open-mpi.org/software/ompi/v3.1/

Primary changes are some important UCX bug fixes and a forward compatibility 
fix in PMIx.

We’re targeting a release on Friday, please test and send results before then.

Thanks,

Brian
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] Open MPI 3.1.1rc1 posted

2018-07-01 Thread Vallee, Geoffroy R.
Hi,

Sorry for the slow feedback but hopefully I have now what I need to give 
feedback in a more timely manner...

I tested the RC on Summitdev at ORNL 
(https://www.olcf.ornl.gov/for-users/system-user-guides/summitdev-quickstart-guide/)
 by running a simple test (I will be running more tests for RCs in a near 
future) and everything seems to be fine.

Thanks,

> On Jun 14, 2018, at 8:05 PM, Barrett, Brian via devel 
>  wrote:
> 
> The first release candidate for Open MPI 3.1.1 is posted at 
> https://www.open-mpi.org/software/ompi/v3.1/.  We’re a bit behind on getting 
> it out the door, so appreciate any testing feedback you have.
> 
> Brian
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Re: [OMPI devel] About supporting HWLOC 2.0.x

2018-05-23 Thread Vallee, Geoffroy R.
I totally missed that PR before I sent my email, sorry. It pretty much covers 
all the modifications I made. :) Let me know if I can help in any way.

Thanks,

> On May 22, 2018, at 11:49 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Geoffroy -- check out https://github.com/open-mpi/ompi/pull/4677.
> 
> If all those issues are now moot, great.  I really haven't followed up much 
> since I made the initial PR; I'm happy to have someone else take it over...
> 
> 
>> On May 22, 2018, at 11:46 AM, Vallee, Geoffroy R.  wrote:
>> 
>> Hi,
>> 
>> HWLOC 2.0.x support was brought up during the call. FYI, I am currently 
>> using (and still testing) hwloc 2.0.1 as an external library with master and 
>> I did not face any major problem; I only had to fix minor things, mainly for 
>> putting the HWLOC topology in a shared memory segment. Let me know if you 
>> want me to help with the effort of supporting HWLOC 2.0.x.
>> 
>> Thanks,
>> ___
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://lists.open-mpi.org/mailman/listinfo/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> 
> ___
> devel mailing list
> devel@lists.open-mpi.org
> https://lists.open-mpi.org/mailman/listinfo/devel
> 

___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] About supporting HWLOC 2.0.x

2018-05-22 Thread Vallee, Geoffroy R.
Hi,

HWLOC 2.0.x support was brought up during the call. FYI, I am currently using 
(and still testing) hwloc 2.0.1 as an external library with master and I did 
not face any major problem; I only had to fix minor things, mainly for putting 
the HWLOC topology in a shared memory segment. Let me know if you want me to 
help with the effort of supporting HWLOC 2.0.x.

Thanks,
___
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel


[OMPI devel] v3 branch - Problem with LSF

2017-05-05 Thread Vallee, Geoffroy R.
Hi,

I am running some tests on a PPC platform that is using LSF and I see the 
following problem every time I launch a job that runs on 2 nodes or more:

[crest1:49998] *** Process received signal ***
[crest1:49998] Signal: Segmentation fault (11)
[crest1:49998] Signal code: Address not mapped (1)
[crest1:49998] Failing at address: 0x10061636d2d
[crest1:49998] [ 0] [0x10050478]
[crest1:49998] [ 1] 
/opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(+0x0)[0x109c]
[crest1:49998] [ 2] 
/opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/liblsf.so(straddr_isIPv4+0x44)[0x10e31b64]
[crest1:49998] [ 3] 
/opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_array2LIST+0x114)[0x10be79b4]
[crest1:49998] [ 4] 
/opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_pjob_constructList+0xfc)[0x10becdbc]
[crest1:49998] [ 5] 
/opt/lsf/9.1/linux3.10-glibc2.17-ppc64le/lib/libbat.so(lsb_launch+0x184)[0x10bed9c4]
[crest1:49998] [ 6] 
/ccs/home/gvh/install/crest/ompi3_llvm/lib/openmpi/mca_plm_lsf.so(+0x2660)[0x10992660]
[crest1:49998] [ 7] 
/ccs/home/gvh/install/crest/ompi3_llvm/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x940)[0x101f7730]
[crest1:49998] [ 8] 
/ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x100013e4]
[crest1:49998] [ 9] 
/ccs/home/gvh/install/crest/ompi3_llvm/bin/mpiexec[0x1f10]
[crest1:49998] [10] /lib64/power8/libc.so.6(+0x24580)[0x104f4580]
[crest1:49998] [11] 
/lib64/power8/libc.so.6(__libc_start_main+0xc4)[0x104f4774]
[crest1:49998] *** End of error message ***

I do not experience that problem with master and the only difference about the 
LSF support between master and the v3 branch is:

https://github.com/open-mpi/ompi/commit/92c996487c589ef8558a087ce2a9923dacdf0b99

If I can confirm that this change fixes the problem with the v3 branch, would 
you guys accept to bring it into the v3 branch?

Thanks,
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics

2016-09-07 Thread Vallee, Geoffroy R.
I just tried the fix and i can confirm that it fixes the problem. :)

Thanks!!!

> On Sep 2, 2016, at 6:18 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Issue filed at https://github.com/open-mpi/ompi/issues/2044.
> 
> I asked Nathan and Sylvain to have a look.
> 
> 
>> On Sep 1, 2016, at 9:20 PM, Paul Hargrove  wrote:
>> 
>> I failed to get PGI 16.x working at all (licence issue, I think).
>> So, I can neither confirm nor refute Geoffroy's reported problems.
>> 
>> -Paul
>> 
>> On Thu, Sep 1, 2016 at 6:15 PM, Vallee, Geoffroy R.  
>> wrote:
>> Interesting, I am having the problem with both 16.5 and 16.7.
>> 
>> My 2 cents,
>> 
>>> On Sep 1, 2016, at 8:25 PM, Paul Hargrove  wrote:
>>> 
>>> FWIW I have not seen problems when testing the 2.0.1rc2 w/ PGI versions 
>>> 12.10, 13.9, 14.3 or 15.9.
>>> 
>>> I am going to test 2.0.2.rc3 ASAP and try to get PGI 16.4 coverage added in
>>> 
>>> -Paul
>>> 
>>> On Thu, Sep 1, 2016 at 12:48 PM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> Please send all the information on the build support page and open an issue 
>>> at github.  Thanks.
>>> 
>>> 
>>>> On Sep 1, 2016, at 3:41 PM, Vallee, Geoffroy R.  wrote:
>>>> 
>>>> This is indeed a little better but still creating a problem:
>>>> 
>>>> CCLD opal_wrapper
>>>> ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
>>>> `_opal_progress_unregister':
>>>> /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459:
>>>>  undefined reference to `opal_atomic_swap_64'
>>>> ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
>>>> `_opal_progress_register':
>>>> /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398:
>>>>  undefined reference to `opal_atomic_swap_64'
>>>> make[2]: *** [opal_wrapper] Error 2
>>>> make[2]: Leaving directory 
>>>> `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers'
>>>> make[1]: *** [all-recursive] Error 1
>>>> make[1]: Leaving directory 
>>>> `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal'
>>>> make: *** [all-recursive] Error 1
>>>> 
>>>> $ nm libopen-pal.a  | grep atomic
>>>>U opal_atomic_cmpset_64
>>>> 0ab0 t opal_atomic_cmpset_ptr
>>>>U opal_atomic_wmb
>>>> 0950 t opal_lifo_push_atomic
>>>>U opal_atomic_cmpset_acq_32
>>>> 03d0 t opal_atomic_lock
>>>> 0450 t opal_atomic_unlock
>>>>U opal_atomic_wmb
>>>>U opal_atomic_ll_64
>>>>U opal_atomic_sc_64
>>>>U opal_atomic_wmb
>>>> 1010 t opal_lifo_pop_atomic
>>>>U opal_atomic_cmpset_acq_32
>>>> 04b0 t opal_atomic_init
>>>> 04e0 t opal_atomic_lock
>>>>U opal_atomic_mb
>>>> 0560 t opal_atomic_unlock
>>>>U opal_atomic_wmb
>>>>U opal_atomic_add_32
>>>>U opal_atomic_cmpset_acq_32
>>>> 0820 t opal_atomic_init
>>>> 0850 t opal_atomic_lock
>>>>U opal_atomic_sub_32
>>>>U opal_atomic_swap_64
>>>> 08d0 t opal_atomic_unlock
>>>>U opal_atomic_wmb
>>>> 0130 t opal_atomic_init
>>>> atomic-asm.o:
>>>> 0138 T opal_atomic_add_32
>>>> 0018 T opal_atomic_cmpset_32
>>>> 00c4 T opal_atomic_cmpset_64
>>>> 003c T opal_atomic_cmpset_acq_32
>>>> 00e8 T opal_atomic_cmpset_acq_64
>>>> 0070 T opal_atomic_cmpset_rel_32
>>>> 0110 T opal_atomic_cmpset_rel_64
>>>>  T opal_atomic_mb
>>>> 0008 T opal_atomic_rmb
>>>> 0150 T opal_atomic_sub_32
>>>> 0010 T opal_atomic_wmb
>>>> 2280 t mca_base_pvar_is_atomic
>>>>U opal_atomic_ll_64
>>>>U opal_atomic_sc_64
>>>>U opal_atomic_wmb
>>>> 0

Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics

2016-09-01 Thread Vallee, Geoffroy R.
Interesting, I am having the problem with both 16.5 and 16.7.

My 2 cents,

> On Sep 1, 2016, at 8:25 PM, Paul Hargrove  wrote:
> 
> FWIW I have not seen problems when testing the 2.0.1rc2 w/ PGI versions 
> 12.10, 13.9, 14.3 or 15.9.
> 
> I am going to test 2.0.2.rc3 ASAP and try to get PGI 16.4 coverage added in
> 
> -Paul
> 
> On Thu, Sep 1, 2016 at 12:48 PM, Jeff Squyres (jsquyres)  
> wrote:
> Please send all the information on the build support page and open an issue 
> at github.  Thanks.
> 
> 
> > On Sep 1, 2016, at 3:41 PM, Vallee, Geoffroy R.  wrote:
> >
> > This is indeed a little better but still creating a problem:
> >
> >  CCLD opal_wrapper
> > ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
> > `_opal_progress_unregister':
> > /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459:
> >  undefined reference to `opal_atomic_swap_64'
> > ../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
> > `_opal_progress_register':
> > /autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398:
> >  undefined reference to `opal_atomic_swap_64'
> > make[2]: *** [opal_wrapper] Error 2
> > make[2]: Leaving directory 
> > `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers'
> > make[1]: *** [all-recursive] Error 1
> > make[1]: Leaving directory 
> > `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal'
> > make: *** [all-recursive] Error 1
> >
> > $ nm libopen-pal.a  | grep atomic
> > U opal_atomic_cmpset_64
> > 0ab0 t opal_atomic_cmpset_ptr
> > U opal_atomic_wmb
> > 0950 t opal_lifo_push_atomic
> > U opal_atomic_cmpset_acq_32
> > 03d0 t opal_atomic_lock
> > 0450 t opal_atomic_unlock
> > U opal_atomic_wmb
> > U opal_atomic_ll_64
> > U opal_atomic_sc_64
> > U opal_atomic_wmb
> > 1010 t opal_lifo_pop_atomic
> > U opal_atomic_cmpset_acq_32
> > 04b0 t opal_atomic_init
> > 04e0 t opal_atomic_lock
> > U opal_atomic_mb
> > 0560 t opal_atomic_unlock
> > U opal_atomic_wmb
> > U opal_atomic_add_32
> > U opal_atomic_cmpset_acq_32
> > 0820 t opal_atomic_init
> > 0850 t opal_atomic_lock
> > U opal_atomic_sub_32
> > U opal_atomic_swap_64
> > 08d0 t opal_atomic_unlock
> > U opal_atomic_wmb
> > 0130 t opal_atomic_init
> > atomic-asm.o:
> > 0138 T opal_atomic_add_32
> > 0018 T opal_atomic_cmpset_32
> > 00c4 T opal_atomic_cmpset_64
> > 003c T opal_atomic_cmpset_acq_32
> > 00e8 T opal_atomic_cmpset_acq_64
> > 0070 T opal_atomic_cmpset_rel_32
> > 0110 T opal_atomic_cmpset_rel_64
> >  T opal_atomic_mb
> > 0008 T opal_atomic_rmb
> > 0150 T opal_atomic_sub_32
> > 0010 T opal_atomic_wmb
> > 2280 t mca_base_pvar_is_atomic
> > U opal_atomic_ll_64
> > U opal_atomic_sc_64
> > U opal_atomic_wmb
> > 0900 t opal_lifo_pop_atomic
> >
> >> On Sep 1, 2016, at 3:16 PM, Jeff Squyres (jsquyres)  
> >> wrote:
> >>
> >> Can you try the latest v2.0.1 nightly snapshot tarball?
> >>
> >>
> >>> On Sep 1, 2016, at 2:56 PM, Vallee, Geoffroy R.  wrote:
> >>>
> >>> Hello,
> >>>
> >>> I get the following problem when we compile OpenMPI-2.0.0 (it seems to be 
> >>> specific to 2.x; the problem did not appear with 1.10.x) with PGI:
> >>>
> >>> CCLD opal_wrapper
> >>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
> >>> `opal_atomic_sc_64'
> >>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
> >>> `opal_atomic_ll_64'
> >>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
> >>> `opal_atomic_swap_64'
> >>> make[1]: *** [opal_wrapper] Error 2
> >>>
> >>> It is a little for me to pin point the exact problem but i can see the 
> >>> following:
> >>>
> >>> $ nm ./.libs/

Re: [OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics

2016-09-01 Thread Vallee, Geoffroy R.
This is indeed a little better but still creating a problem:

  CCLD opal_wrapper
../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
`_opal_progress_unregister':
/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:459: 
undefined reference to `opal_atomic_swap_64'
../../../opal/.libs/libopen-pal.a(opal_progress.o): In function 
`_opal_progress_register':
/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/runtime/opal_progress.c:398: 
undefined reference to `opal_atomic_swap_64'
make[2]: *** [opal_wrapper] Error 2
make[2]: Leaving directory 
`/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal/tools/wrappers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/autofs/nccs-svm1_sw/gvh/src/openmpi-2.0.1rc2/opal'
make: *** [all-recursive] Error 1

$ nm libopen-pal.a  | grep atomic
 U opal_atomic_cmpset_64
0ab0 t opal_atomic_cmpset_ptr
 U opal_atomic_wmb
0950 t opal_lifo_push_atomic
 U opal_atomic_cmpset_acq_32
03d0 t opal_atomic_lock
0450 t opal_atomic_unlock
 U opal_atomic_wmb
 U opal_atomic_ll_64
 U opal_atomic_sc_64
 U opal_atomic_wmb
1010 t opal_lifo_pop_atomic
 U opal_atomic_cmpset_acq_32
04b0 t opal_atomic_init
04e0 t opal_atomic_lock
 U opal_atomic_mb
0560 t opal_atomic_unlock
 U opal_atomic_wmb
 U opal_atomic_add_32
 U opal_atomic_cmpset_acq_32
0820 t opal_atomic_init
0850 t opal_atomic_lock
 U opal_atomic_sub_32
 U opal_atomic_swap_64
08d0 t opal_atomic_unlock
 U opal_atomic_wmb
0130 t opal_atomic_init
atomic-asm.o:
0138 T opal_atomic_add_32
0018 T opal_atomic_cmpset_32
00c4 T opal_atomic_cmpset_64
003c T opal_atomic_cmpset_acq_32
00e8 T opal_atomic_cmpset_acq_64
0070 T opal_atomic_cmpset_rel_32
0110 T opal_atomic_cmpset_rel_64
 T opal_atomic_mb
0008 T opal_atomic_rmb
0150 T opal_atomic_sub_32
0010 T opal_atomic_wmb
2280 t mca_base_pvar_is_atomic
 U opal_atomic_ll_64
 U opal_atomic_sc_64
 U opal_atomic_wmb
0900 t opal_lifo_pop_atomic

> On Sep 1, 2016, at 3:16 PM, Jeff Squyres (jsquyres)  
> wrote:
> 
> Can you try the latest v2.0.1 nightly snapshot tarball?
> 
> 
>> On Sep 1, 2016, at 2:56 PM, Vallee, Geoffroy R.  wrote:
>> 
>> Hello,
>> 
>> I get the following problem when we compile OpenMPI-2.0.0 (it seems to be 
>> specific to 2.x; the problem did not appear with 1.10.x) with PGI:
>> 
>> CCLD opal_wrapper
>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
>> `opal_atomic_sc_64'
>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
>> `opal_atomic_ll_64'
>> ../../../opal/.libs/libopen-pal.so: undefined reference to 
>> `opal_atomic_swap_64'
>> make[1]: *** [opal_wrapper] Error 2
>> 
>> It is a little for me to pin point the exact problem but i can see the 
>> following:
>> 
>> $ nm ./.libs/libopen-pal.so | grep atomic
>> 00026320 t 0017.plt_call.opal_atomic_add_32
>> 00026250 t 0017.plt_call.opal_atomic_cmpset_32
>> 00026780 t 0017.plt_call.opal_atomic_cmpset_64
>> 000280c0 t 0017.plt_call.opal_atomic_cmpset_acq_32
>> 00028ae0 t 0017.plt_call.opal_atomic_ll_64
>> 00027fe0 t 0017.plt_call.opal_atomic_mb
>> 00027d50 t 0017.plt_call.opal_atomic_rmb
>> 00028500 t 0017.plt_call.opal_atomic_sc_64
>> 00027670 t 0017.plt_call.opal_atomic_sub_32
>> 00026da0 t 0017.plt_call.opal_atomic_swap_64
>> 00027050 t 0017.plt_call.opal_atomic_wmb
>> 0005e6a0 t mca_base_pvar_is_atomic
>> 0004715c T opal_atomic_add_32
>> 0004703c T opal_atomic_cmpset_32
>> 000470e8 T opal_atomic_cmpset_64
>> 00047060 T opal_atomic_cmpset_acq_32
>> 0004710c T opal_atomic_cmpset_acq_64
>> 0002a610 t opal_atomic_cmpset_ptr
>> 00047094 T opal_atomic_cmpset_rel_32
>> 00047134 T opal_atomic_cmpset_rel_64
>> 00032cc0 t opal_atomic_init
>> 00033980 t opal_atomic_init
>> 000396a0 t opal_atomic_init
>>U opal_atomic_ll_64
>> 0002e460 t opal_atomic_lock
>> 00032cf0 t opal_atomic_lock
>> 000

[OMPI devel] openmpi-2.0.0 - problems with ppc64, PGI and atomics

2016-09-01 Thread Vallee, Geoffroy R.
Hello,

I get the following problem when we compile OpenMPI-2.0.0 (it seems to be 
specific to 2.x; the problem did not appear with 1.10.x) with PGI:

CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_sc_64'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_ll_64'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_atomic_swap_64'
make[1]: *** [opal_wrapper] Error 2

It is a little for me to pin point the exact problem but i can see the 
following:

$ nm ./.libs/libopen-pal.so | grep atomic
00026320 t 0017.plt_call.opal_atomic_add_32
00026250 t 0017.plt_call.opal_atomic_cmpset_32
00026780 t 0017.plt_call.opal_atomic_cmpset_64
000280c0 t 0017.plt_call.opal_atomic_cmpset_acq_32
00028ae0 t 0017.plt_call.opal_atomic_ll_64
00027fe0 t 0017.plt_call.opal_atomic_mb
00027d50 t 0017.plt_call.opal_atomic_rmb
00028500 t 0017.plt_call.opal_atomic_sc_64
00027670 t 0017.plt_call.opal_atomic_sub_32
00026da0 t 0017.plt_call.opal_atomic_swap_64
00027050 t 0017.plt_call.opal_atomic_wmb
0005e6a0 t mca_base_pvar_is_atomic
0004715c T opal_atomic_add_32
0004703c T opal_atomic_cmpset_32
000470e8 T opal_atomic_cmpset_64
00047060 T opal_atomic_cmpset_acq_32
0004710c T opal_atomic_cmpset_acq_64
0002a610 t opal_atomic_cmpset_ptr
00047094 T opal_atomic_cmpset_rel_32
00047134 T opal_atomic_cmpset_rel_64
00032cc0 t opal_atomic_init
00033980 t opal_atomic_init
000396a0 t opal_atomic_init
 U opal_atomic_ll_64
0002e460 t opal_atomic_lock
00032cf0 t opal_atomic_lock
000339b0 t opal_atomic_lock
00047024 T opal_atomic_mb
0004702c T opal_atomic_rmb
 U opal_atomic_sc_64
00047174 T opal_atomic_sub_32
 U opal_atomic_swap_64
0002e4e0 t opal_atomic_unlock
00032d70 t opal_atomic_unlock
00033a30 t opal_atomic_unlock
00047034 T opal_atomic_wmb
000324d0 t opal_lifo_pop_atomic
000cc260 t opal_lifo_pop_atomic
0002a490 t opal_lifo_push_atomic

Any idea of how to fix the problem?

Thanks,
___
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel


Re: [OMPI devel] Open MPI face-to-face devel meeting: Jan/Feb 2016

2015-10-08 Thread Vallee, Geoffroy R.
I don't know if it would make sense to send someone (or even if someone is 
already supposed to go) but they are planning for the next open mpi developer 
meeting and since we have so much going on with open mpi, I thought it would 
make sense to forward this email.

Thanks,

From: "Jeff Squyres (jsquyres)"
Sent: Thursday, October 8, 2015 3:47 PM
To: Open MPI Developers List
Subject: [OMPI devel] Open MPI face-to-face devel meeting: Jan/Feb 2016


Developers --

It's time to schedule our next face-to-face meeting.  IBM has graciously 
offered the use of their facilities in Dallas, TX.  Apparently hotels and the 
IBM facilities are within a tax ride of the Dallas airport (i.e., much closer 
than the Cisco facilities).

Right now, the facilities are fairly open through Jan and Feb, but they book up 
fast.  So please answer this Doodle by the weekly webex next Tuesday (13 Oct 
2015) so that we can pick a week:

http://doodle.com/poll/fzr9vebqpsh37ii6

I (pseudo-)arbitrarily picked Tue-Thu meeting days, assuming that people would 
fly in on Monday, and we could start first thing on Tuesday morning.  And then 
finish up by early afternoon Thursday so people could possibly fly out Thursday 
afternoon (or Friday, if that's not possible).

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2015/10/18150.php



Re: [OMPI devel] [OMPI svn] svn:open-mpi r31577 - trunk/ompi/mca/rte/base

2014-05-01 Thread Vallee, Geoffroy R.
Too bad all this happened so fast otherwise ORNL would have at least 
participated to the call to understand what is going to happen (since we have a 
RTE module that we maintain). Any chance we could have a summary?

Thanks,


On May 1, 2014, at 2:40 PM, Ralph Castain  wrote:

> Just to report back to the list: the three of us discussed this at some 
> length, and decided we like George's proposed solution. Looks like a good 
> clean approach that provides flexibility for the future. So we will introduce 
> it when the BTLs move down to OPAL as (a) George already has it implemented 
> there, and (b) we don't really need it before then.
> 
> Thanks George!
> Ralph
> 
> 
> On May 1, 2014, at 9:40 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Done!
>> 
>> On May 1, 2014, at 11:22 AM, George Bosilca  wrote:
>> 
>>> Apparently we are good today at 2PM EST. Fire-up the webex ;)
>>> 
>>> George.
>>> 
>>> On May 1, 2014, at 10:35 , Jeff Squyres (jsquyres)  
>>> wrote:
>>> 
 http://doodle.com/hhm4yyr76ipcxgk2
 
 
 On May 1, 2014, at 10:25 AM, Ralph Castain 
 wrote:
 
> sure - might be faster that way :-)
> 
> On May 1, 2014, at 6:59 AM, Jeff Squyres (jsquyres)  
> wrote:
> 
>> Want to have a phone call/webex to discuss?
>> 
>> 
>> On May 1, 2014, at 9:43 AM, Ralph Castain  wrote:
>> 
>>> The problem we'll have with BTLs in opal is going to revolve around 
>>> that ompi_process_name_t and will occur in a number of places. I've 
>>> been trying to grok George's statement about accessors and can't figure 
>>> out a clean way to make that work IF every RTE gets to define the 
>>> process name a different way.
>>> 
>>> For example, suppose I define ompi_process_name_t to be a string. I can 
>>> hash the string down to an opal_identifier_t, but that is a 
>>> structureless 64-bit value - there is no concept of a jobid or vpid in 
>>> it. So if you now want to extract a jobid for that identifier, the only 
>>> way you can do it is to "up-call" back to the RTE to parse it.
>>> 
>>> This means that every RTE would have to initialize OPAL with a 
>>> registration of its opal_identifier parser function(s), which seems 
>>> like a really ugly solution.
>>> 
>>> Maybe it is time to shift the process identifier down to the opal 
>>> layer? If we define opal_identifier_t to include the required 
>>> jobid/vpid, perhaps adding a void* so someone can put whatever they 
>>> want in it?
>>> 
>>> Note that I'm not wild about extending the identifier size beyond 
>>> 64-bits as the memory footprint issue is growing in concern, and I 
>>> still haven't seen any real use-case proposed for extending it.
>>> 
>>> 
>>> On May 1, 2014, at 3:41 AM, Jeff Squyres (jsquyres) 
>>>  wrote:
>>> 
 On Apr 30, 2014, at 10:01 PM, George Bosilca  
 wrote:
 
> Why do you need the ompi_process_name_t? Isn’t the opal_identifier_t 
> enough to dig for the info of the peer into the opal_db?
 
 
 At the moment, I use the ompi_process_name_t for RML sends/receives in 
 the usnic BTL.  I know this will have to change when the BTLs move 
 down to OPAL (when is that going to happen, BTW?).  So my future use 
 case may be somewhat moot.
 
 More detail
 ===
 
 "Why does the usnic BTL use RML sends/receives?", you ask.
 
 The reason is rooted in the fact that the usnic BTL uses an 
 unreliable, connectionless transport under the covert.  We had some 
 customers have network misconfigurations that resulted in usnic 
 traffic not flowing properly (e.g., MTU mismatches in the network).  
 But since we don't have a connection-oriented underlying API that will 
 eventually timeout/fail to connect/etc. when there's a problem with 
 the network configuration, we added a "connection validation" service 
 in the usnic BTL that fires up in a thread in the local rank 0 on each 
 server.  This thread provides service to all the MPI processes on its 
 server.
 
 In short: the service thread sends UDP pings and ACKs to peer service 
 threads on other servers (upon demand/upon first send between servers) 
 to verify network connectivity.  If the pings eventually fail/timeout 
 (i.e., don't get ACKs back), the service thread does a show_help and 
 kills the job.
 
 There's more details, but that's the gist of it.
 
 This basically gives us the ability to highlight problems in the 
 network and kill the MPI job rather than spin infinitely while trying 
 to deliver MPI/BTL messages to a peer that will never get there.
 
 Since this is really a server-to-server network connectiv

[OMPI devel] Direct references to ORTE from OMPI

2013-09-30 Thread Vallee, Geoffroy R.
Hi,

Instead of references to the RTE layer, there are a few direct references to 
ORTE symbols in the current OMPI layer. The attached patches fix the problem.

Thanks,



proc_c.patch
Description: proc_c.patch


comm_c.patch
Description: comm_c.patch


[OMPI devel] Problem with multiple identical entries in ~/.openmpi/mca-params.conf

2013-09-20 Thread Vallee, Geoffroy R.
Hi,

I found a very unexpected behavior with r29217:

% cat ~/.openmpi/mca-params.conf
#pml_base_verbose=0
pml_base_verbose=0

% mpicc -o helloworld helloworld.c

Then if i update the mca-params.conf to have two identical entries, i have 
segfaults:

% cat ~/.openmpi/mca-params.conf   
pml_base_verbose=0
pml_base_verbose=0

% mpicc -o helloworld helloworld.c 
[node0:23157] *** Process received signal ***
[node0:23157] Signal: Segmentation fault (11)
[node0:23157] Signal code: Address not mapped (1)
[node0:23157] Failing at address: 0x7f4812770100
^C

Note that the compilation hangs. Also note that i have the exact same problem 
when running an MPI application that was successfully compiled:

% cat ~/.openmpi/mca-params.conf   
pml_base_verbose=0
#pml_base_verbose=0

% mpirun -np 2 ./helloworld
Hello, World (node0)
Hello, World (node0)

% mpirun -np 2 ./helloworld 
Hello, World (node0)
Hello, World (node0)
[node0:23201] *** Process received signal ***
[node0:23201] Signal: Segmentation fault (11)
[node0:23201] Signal code: Address not mapped (1)
[node0:23201] Failing at address: 0x7f5a8f632c80
[node0:23202] *** Process received signal ***
[node0:23202] Signal: Segmentation fault (11)
[node0:23202] Signal code: Address not mapped (1)
[node0:23202] Failing at address: 0x7f1436605650
^C[node0:23199] *** Process received signal ***
[node0:23199] Signal: Segmentation fault (11)
[node0:23199] Signal code: Address not mapped (1)
[node0:23199] Failing at address: 0x7f9917dd55f0

The problem occurs during opal_finalize() and MCA tries to clean up some 
variables. Sorry i did not have the time to get a full trace.

Best regards,




[OMPI devel] Patch for unnecessary use of a ORTE constant

2013-04-25 Thread Vallee, Geoffroy R.
Hi,

Small patch that remove the use of a ORTE constants that is not justified; the 
OPAL one should be used instead.

Thanks,



ompi_info_support.patch
Description: ompi_info_support.patch


Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures

2013-02-22 Thread Vallee, Geoffroy R.
Thanks Ralph. And sorry for not including the rte/orte/rte_orte.h modification 
in my patch, i am not using ORTE at the moment.


On Feb 22, 2013, at 12:49 PM, Ralph Castain  wrote:

> Hmmwell that doesn't solve the problem either - we also have to typedef 
> ompi_local_rank_t. I've committed the complete fix.
> 
> Thanks
> Ralph
> 
> 
> On Feb 22, 2013, at 9:15 AM, "Vallee, Geoffroy R."  wrote:
> 
>> Well apparently not… another try… sorry for the extra noise.
>> 
>> 
>> 
>> 
>> On Feb 22, 2013, at 12:08 PM, "Vallee, Geoffroy R."  
>> wrote:
>> 
>>> This patch will actually apply correctly, not the first one. Sorry about 
>>> that.
>>> 
>>> 
>>> On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R."  
>>> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> Some of the latest modifications to the SM BTL make a direct reference to 
>>>> ORTE instead of the equivalent at the OMPI level.
>>>> 
>>>> The attached patch fixes that problem.
>>>> 
>>>> Thanks,
>>>> 
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures

2013-02-22 Thread Vallee, Geoffroy R.
Well apparently not… another try… sorry for the extra noise.



btl_sm_component_c.patch
Description: btl_sm_component_c.patch



On Feb 22, 2013, at 12:08 PM, "Vallee, Geoffroy R."  wrote:

> This patch will actually apply correctly, not the first one. Sorry about that.
> 
> 
> On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R."  wrote:
> 
>> Hello,
>> 
>> Some of the latest modifications to the SM BTL make a direct reference to 
>> ORTE instead of the equivalent at the OMPI level.
>> 
>> The attached patch fixes that problem.
>> 
>> Thanks,
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures

2013-02-22 Thread Vallee, Geoffroy R.
This patch will actually apply correctly, not the first one. Sorry about that.



btl_sm_component_c.patch
Description: btl_sm_component_c.patch

On Feb 22, 2013, at 11:57 AM, "Vallee, Geoffroy R."  wrote:

> Hello,
> 
> Some of the latest modifications to the SM BTL make a direct reference to 
> ORTE instead of the equivalent at the OMPI level.
> 
> The attached patch fixes that problem.
> 
> Thanks,
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] Patch for the SM BTL - Remove explicit reference to ORTE data structures

2013-02-22 Thread Vallee, Geoffroy R.
Hello,

Some of the latest modifications to the SM BTL make a direct reference to ORTE 
instead of the equivalent at the OMPI level.

The attached patch fixes that problem.

Thanks,



btl_sm_component_c.patch
Description: btl_sm_component_c.patch


[OMPI devel] ORCA - Another runtime supported

2012-08-22 Thread Vallee, Geoffroy R.
Hello,

FYI, we just finished the implementation of a ORCA module for the support of a 
runtime infrastructure developed at Oak Ridge National Laboratory.
For this, we are currently using the version of ORCA available on the bitbucket 
branch: https://bitbucket.org/jjhursey/ompi-orca

ORCA clearly makes the integration easier and more maintainable; we hope it 
will make its way back into trunk very soon.

Thanks,
-- 
Geoffroy Vallee, PhD
Research Associate
Oak Ridge National Laboratory