Re: [OMPI devel] Question about 'progress function'

2016-05-06 Thread dpchoudh .
George

Thanks for your help. But what should the progress function return, so that
the event is signalled? Right now I am returning a 1 when data has been
transmitted and 0 otherwise, but that does not seem to work. Also, please
keep in mind that the transport I am working on supports unreliable
datagrams only, so there is no ack from the recipient to wait for.

Thanks again
Durga

The surgeon general advises you to eat right, exercise regularly and quit
ageing.

On Thu, May 5, 2016 at 11:33 PM, George Bosilca  wrote:

> Durga,
>
> TCP doesn't need a specialized progress function because we are tied
> directly with libevent. In your case you should provide a BTL progress
> function, function that will be called at the end of libevent base loop
> regularly.
>
>   George.
>
>
> On Thu, May 5, 2016 at 11:30 PM, dpchoudh .  wrote:
>
>> Hi all
>>
>> Apologies for a 101 level question again, but here it is:
>>
>> A new BTL layer I am implementing hangs in MPI_Send(). Please keep in
>> mind that at this stage, I am simply desperate to make MPI data move
>> through this fabric in any way possible, so I have thrown all good
>> programming practice out of the window and in the process might have added
>> bugs.
>>
>> The test code basically has a single call to MPI_Send() with 8 bytes of
>> data, the smallest amount the HCA can DMA. I have a very simple
>> mca_btl_component_progress() method that returns 0 if called before
>> mca_btl_endpoint_send() and returns 1 if called after. I use a static
>> variable to keep track whether endpoint_send() has been called.
>>
>> With this, the MPI process hangs with the following stack:
>>
>> (gdb) bt
>> #0  0x7f7518c60b7d in poll () from /lib64/libc.so.6
>> #1  0x7f75183e79f6 in poll_dispatch (base=0x19cf480,
>> tv=0x7f75177efe80) at poll.c:165
>> #2  0x7f75183df690 in opal_libevent2022_event_base_loop
>> (base=0x19cf480, flags=1) at event.c:1630
>> #3  0x7f75183613d4 in progress_engine (obj=0x19cedd8) at
>> runtime/opal_progress_threads.c:105
>> #4  0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0
>> #5  0x7f7518c6b1ad in clone () from /lib64/libc.so.6
>>
>> I am using code from master branch for this work.
>>
>> Obviously I am not doing the progress handling right, and I don't even
>> understand how it should work, as the TCP btl does not even provide a
>> component progress function.
>>
>> Any relevant pointer on how this should be done is highly appreciated.
>>
>> Thanks
>> Durga
>>
>>
>> The surgeon general advises you to eat right, exercise regularly and quit
>> ageing.
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2016/05/18919.php
>>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18920.php
>


[OMPI devel] [2.0.0rc2] Solaris Studio 12.5-beta build failure (libtool, w/ patch)

2016-05-06 Thread Paul Hargrove
Disclamer first:
Yes, I am testing a *beta* compiler but this is NOT about a compiler bug.
I leave it to the judgment of others whether my findings warrant any
action.


I am testing the 2.0.0rc2 tarball with the Oracle Solaris Studio 12.5-beta
for Linux.

With Studio 12.4 all is fine on the same system with identical configure
options.
The configure command options in both cases (with different $PATH):

--prefix=[...] --enable-debug --enable-mpi-cxx CC=cc CXX=CC FC=f95


With 12.5-beta (but not 12.4) I see the following failure while building
the fortran bindings:

/bin/sh ../../../../libtool  --tag=FC   --mode=link f95
-I../../../../ompi/include
-I/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/openmpi-2.0.0rc2/ompi/include
-I../../../..
-I/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/openmpi-2.0.0rc2
 -g -version-info 20:0:0   -o libmpi_usempi_ignore_tkr.la -rpath
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/INST/lib
mpi-ignore-tkr.lo
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/BLD/opal/
libopen-pal.la -lrt -lm -lutil

libtool: link: f95 -shared   .libs/mpi-ignore-tkr.o-rpath
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/BLD/opal/.libs
-rpath
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/INST/lib
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u5b/BLD/opal/.libs/libopen-pal.so
-ldl -lnuma -lrt -lm -lutil  -g   -mt -soname
libmpi_usempi_ignore_tkr.so.20 -o .libs/libmpi_usempi_ignore_tkr.so.20.0.0

f90: Warning: Option -shared passed to ld, if ld is invoked, ignored
otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored otherwise
f90: Warning: Option -path passed to ld, if ld is invoked, ignored otherwise
f90: Warning: Option -soname passed to ld, if ld is invoked, ignored
otherwise
/usr/bin/ld: unrecognized option '-path'
/usr/bin/ld: use the --help option for usage information
make[2]: *** [libmpi_usempi_ignore_tkr.la] Error 2


My first observation is that f95 command line produced by libtool doesn't
look at all like the same step when using the 12.4 compiler:

libtool: link: f95 -G  -KPIC  .libs/mpi-ignore-tkr.o   -Qoption ld -rpath
-Qoption ld
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u4/BLD/opal/.libs
-Qoption ld -rpath -Qoption ld
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u4/INST/lib
/scratch/phargrov/OMPI/openmpi-2.0.0rc2-linux-x86_64-ss12u4/BLD/opal/.libs/libopen-pal.so
-ldl -lnuma -lrt -lm -lutil  -g   -mt -Qoption ld -soname -Qoption ld
libmpi_usempi_ignore_tkr.so.20 -o .libs/libmpi_usempi_ignore_tkr.so.20.0.0


So, my initial diagnosis was that this is most likely a libtool bug.
BUT, I also know that autogen.pl contains logic for "Patching configure for
Sun Studio Fortran version strings".
So, I recognized that Open MPI's autogen.pl might be where the fix belongs.

Continuing on that line of though I discovered that the version strings for
the Fortran compiler have changed between 12.4 and 12.5-beta:

$ /opt/oracle/solarisstudio12.4/bin/f95 -V
f90: Sun Fortran 95 8.7 Linux_i386 2014/10/20
$ /opt/oracle/solarisstudio12.5-beta/bin/f95 -V
f90: Studio 12.5 Fortran 95 8.8 Linux_i386 Beta 2015/11/17


Notice that "Sun" is gone, and thus the patterns used by libtool no longer
match!
So, this *is* a libtool "issue", but can hardly be blamed on libtool.

FWIW: the C and C++ compiler still say "Sun" and match the patterns used by
libtool:

$ /opt/oracle/solarisstudio12.5-beta/bin/cc -V
cc: Studio 12.5 Sun C 5.14 Linux_i386 Beta 2015/11/17
$ /opt/oracle/solarisstudio12.5-beta/bin/CC -V
CC: Studio 12.5 Sun C++ 5.14 Linux_i386 Beta 2015/11/17


I checked the libtool git repo, and there is no upstream fix for this (Ralf
W., are you still on this list?)
So, for all that wind-up this posting comes down to a small addition to
autogen.sh:

--- a/autogen.pl
+++ b/autogen.pl
@@ -977,6 +977,12 @@ sub patch_autotools_output {
 $c =~ s/$search_string/$replace_string/;
 }

+# Oracle has apparently begun (as of 12.5-beta) removing the "Sun"
branding.
+# So this patch (cumulative over the previous one) is required.
+push(@verbose_out, $indent_str . "Patching configure for Oracle Studio
Fortran version strings\n");
+$c =~ s/\*Sun\*Fortran\*\)/*Sun*Fortran* | *Studio*Fortran*)/g;
+$c =~ s/\*Sun\\ F\*\)(.*\n\s+tmp_sharedflag=)/*Sun\\ F* |
*Studio*Fortran*)$1/g;
+
 # See
http://git.savannah.gnu.org/cgit/libtool.git/commit/?id=v2.2.6-201-g519bf91
for details
 # Note that this issue was fixed in LT 2.2.8, however most distros are
still using 2.2.6b


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] Question about 'progress function'

2016-05-06 Thread Nathan Hjelm

The return code of your progress function should be related to the
activity (send, recv, put, get, etc completion) on your network. The
return is not really used right now but may be meaningful in the
future.

Your BTL signals progress through two mechanisms:

 1) Send completion is indicated by either your btl_send() function
 returning 1 (this indicates no calls to btl_progress() are needed and
 that the user buffer is no longer needed), your btl_sendi() function
 returning OPAL_SUCCESS, or you calling the send fragment's callback
 function. btl_send() is the minimum function needed but btl_sendi() can
 provide a faster path to putting a fragment on a network.

 2) Receive completion is indicated by calling a callback associated
 with a fragment's tag. This tag is supplied to btl_send() and
 btl_sendi() is usually sent with the fragment data (usually inline with
 the data). A typical progress function polls the network and on finding
 an incoming fragment, extracts the btl tag and calls the associated
 calback.

It is usually helpful to look at how other btl's work but you can also
find quite a bit of information in opal/mca/btl/btl.h.

-Nathan

On Fri, May 06, 2016 at 12:01:05AM -0400, dpchoudh . wrote:
>George
> 
>Thanks for your help. But what should the progress function return, so
>that the event is signalled? Right now I am returning a 1 when data has
>been transmitted and 0 otherwise, but that does not seem to work. Also,
>please keep in mind that the transport I am working on supports unreliable
>datagrams only, so there is no ack from the recipient to wait for.
> 
>Thanks again
>Durga
>The surgeon general advises you to eat right, exercise regularly and quit
>ageing.
>On Thu, May 5, 2016 at 11:33 PM, George Bosilca 
>wrote:
> 
>  Durga,
>  TCP doesn't need a specialized progress function because we are tied
>  directly with libevent. In your case you should provide a BTL progress
>  function, function that will be called at the end of libevent base loop
>  regularly.
>George.
>  On Thu, May 5, 2016 at 11:30 PM, dpchoudh .  wrote:
> 
>Hi all
> 
>Apologies for a 101 level question again, but here it is:
> 
>A new BTL layer I am implementing hangs in MPI_Send(). Please keep in
>mind that at this stage, I am simply desperate to make MPI data move
>through this fabric in any way possible, so I have thrown all good
>programming practice out of the window and in the process might have
>added bugs.
> 
>The test code basically has a single call to MPI_Send() with 8 bytes
>of data, the smallest amount the HCA can DMA. I have a very simple
>mca_btl_component_progress() method that returns 0 if called before
>mca_btl_endpoint_send() and returns 1 if called after. I use a static
>variable to keep track whether endpoint_send() has been called.
> 
>With this, the MPI process hangs with the following stack:
> 
>(gdb) bt
>#0  0x7f7518c60b7d in poll () from /lib64/libc.so.6
>#1  0x7f75183e79f6 in poll_dispatch (base=0x19cf480,
>tv=0x7f75177efe80) at poll.c:165
>#2  0x7f75183df690 in opal_libevent2022_event_base_loop
>(base=0x19cf480, flags=1) at event.c:1630
>#3  0x7f75183613d4 in progress_engine (obj=0x19cedd8) at
>runtime/opal_progress_threads.c:105
>#4  0x7f7518f3ddf5 in start_thread () from /lib64/libpthread.so.0
>#5  0x7f7518c6b1ad in clone () from /lib64/libc.so.6
> 
>I am using code from master branch for this work.
> 
>Obviously I am not doing the progress handling right, and I don't even
>understand how it should work, as the TCP btl does not even provide a
>component progress function.
> 
>Any relevant pointer on how this should be done is highly appreciated.
> 
>Thanks
>Durga
> 
>The surgeon general advises you to eat right, exercise regularly and
>quit ageing.
>___
>devel mailing list
>de...@open-mpi.org
>Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>Link to this post:
>http://www.open-mpi.org/community/lists/devel/2016/05/18919.php
> 
>  ___
>  devel mailing list
>  de...@open-mpi.org
>  Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
>  Link to this post:
>  http://www.open-mpi.org/community/lists/devel/2016/05/18920.php

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/05/18922.php



pgpItDUCsku5A.pgp
Description: PGP signature


Re: [OMPI devel] [2.0.0rc2] build failures on OpenBSD-5.7 (romio)

2016-05-06 Thread Gilles Gouaillardet

Paul,


can you please give a try to 
https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/1643.patch ?



Cheers,


Gilles


On 5/3/2016 2:21 PM, Paul Hargrove wrote:
This is NOT a new issue, but I wanted to mention it explicitly once 
again since no progress has been made since I first reported to 
problem in 1.8.6rc1 (about 1 year ago).


Though the directory name and line numbers have changed, the error 
reported in 
https://www.open-mpi.org/community/lists/devel/2015/05/17449.php is 
still present in the 2.0.0rc2 tarball, and prevents building on 
OpenBSD-5.7 unless one configures with --disable-io-romio:


libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I. 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio 
-I./adio/include -DOMPI_BUILDING=1 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/../../../../.. 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/../../../../../opal/include 
-I./../../../../../opal/include -I./../../../../../ompi/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/include 
-I./include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/include 
-I./mpi-io 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/mpi-io 
-I./adio/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/hwloc/hwloc1112/hwloc/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/BLD/opal/mca/hwloc/hwloc1112/hwloc/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent/include 
-I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/BLD/opal/mca/event/libevent2022/libevent/include 
-g -finline-functions -fno-strict-aliasing -pthread -D__EXTENSIONS__ 
-DHAVE_ROMIOCONF_H -I./include -MT adio/common/ad_fstype.lo -MD -MP 
-MF adio/common/.deps/ad_fstype.Tpo -c 
/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c 
 -fPIC -DPIC -o adio/common/.libs/ad_fstype.o
/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c: 
In function 'ADIO_FileSysType_fncall':
/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:358: 
error: 'struct statfs' has no member named 'f_type'
/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:382: 
error: 'struct statfs' has no member named 'f_type'
/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:403: 
error: 'struct statfs' has no member named 'f_type'
*** Error 1 in ompi/mca/io/romio314/romio (Makefile:3548 
'adio/common/ad_fstype.lo')

*** Error 1 in ompi/mca/io/romio314/romio (Makefile:4409 'all-recursive')
*** Error 1 in ompi/mca/io/romio314 (Makefile:1954 'all-recursive')
*** Error 1 in ompi (Makefile:3352 'all-recursive')


-Paul

--
Paul H. Hargrove phhargr...@lbl.gov 
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


___
devel mailing list
de...@open-mpi.org
Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/05/18891.php




Re: [OMPI devel] [2.0.0rc2] build failures on OpenBSD-5.7 (romio)

2016-05-06 Thread Paul Hargrove
Gilles,

I am testing and will follow-up in the PR.

-Paul

On Thu, May 5, 2016 at 11:02 PM, Gilles Gouaillardet 
wrote:

> Paul,
>
>
> can you please give a try to
> https://patch-diff.githubusercontent.com/raw/open-mpi/ompi/pull/1643.patch
> ?
>
>
> Cheers,
>
>
> Gilles
>
> On 5/3/2016 2:21 PM, Paul Hargrove wrote:
>
> This is NOT a new issue, but I wanted to mention it explicitly once again
> since no progress has been made since I first reported to problem in
> 1.8.6rc1 (about 1 year ago).
>
> Though the directory name and line numbers have changed, the error
> reported in
> https://www.open-mpi.org/community/lists/devel/2015/05/17449.php is still
> present in the 2.0.0rc2 tarball, and prevents building on OpenBSD-5.7
> unless one configures with --disable-io-romio:
>
> libtool: compile:  gcc -std=gnu99 -DHAVE_CONFIG_H -I.
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio
> -I./adio/include -DOMPI_BUILDING=1
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/../../../../..
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/../../../../../opal/include
> -I./../../../../../opal/include -I./../../../../../ompi/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/include
> -I./include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/include
> -I./mpi-io
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/mpi-io
> -I./adio/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/hwloc/hwloc1112/hwloc/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/BLD/opal/mca/hwloc/hwloc1112/hwloc/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/opal/mca/event/libevent2022/libevent/include
> -I/home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/BLD/opal/mca/event/libevent2022/libevent/include
> -g -finline-functions -fno-strict-aliasing -pthread -D__EXTENSIONS__
> -DHAVE_ROMIOCONF_H -I./include -MT adio/common/ad_fstype.lo -MD -MP -MF
> adio/common/.deps/ad_fstype.Tpo -c
> /home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c
>  -fPIC -DPIC -o adio/common/.libs/ad_fstype.o
> /home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:
> In function 'ADIO_FileSysType_fncall':
> /home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:358:
> error: 'struct statfs' has no member named 'f_type'
> /home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:382:
> error: 'struct statfs' has no member named 'f_type'
> /home/phargrov/OMPI/openmpi-2.0.0rc2-openbsd5-amd64/openmpi-2.0.0rc2/ompi/mca/io/romio314/romio/adio/common/ad_fstype.c:403:
> error: 'struct statfs' has no member named 'f_type'
> *** Error 1 in ompi/mca/io/romio314/romio (Makefile:3548
> 'adio/common/ad_fstype.lo')
> *** Error 1 in ompi/mca/io/romio314/romio (Makefile:4409 'all-recursive')
> *** Error 1 in ompi/mca/io/romio314 (Makefile:1954 'all-recursive')
> *** Error 1 in ompi (Makefile:3352 'all-recursive')
>
>
> -Paul
>
> --
> Paul H. Hargrove   
> phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
>
> ___
> devel mailing listde...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/05/18891.php
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18925.php
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] [2.0.0rc2] xlc-13.1.0 ICE (hwloc)

2016-05-06 Thread Josh Hursey
Brice:
  Can you take a look at Paul's patch here:
  https://www.open-mpi.org/community/lists/devel/2016/05/18918.php

Thanks,
Josh

On Thu, May 5, 2016 at 4:28 PM, Jeff Squyres (jsquyres) 
wrote:

> On May 5, 2016, at 5:27 PM, Josh Hursey  wrote:
> >
> > Since this also happens with hwloc 1.11.3 standalone maybe hwloc folks
> can take point on further investigation?
>
> I think Brice would love your assistance in figuring this out, since I'm
> guessing he doesn't have access to these platforms, either.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18917.php
>


[OMPI devel] [v2.x] more "patcher" issues

2016-05-06 Thread Paul Hargrove
I am testing a tarball built from v2.x-dev-1410-g81e0924
This includes pull request #1128 in which Nathan addressed multiple
"patcher" issues.

However, I see the crash below in dlopen_test on a LITTLE-ENDIAN Power8
system.
This is happening when built with "V13.1.2 (5725-C73, 5765-J08)", but not
with gcc on the same system.
So, I cannot conclusively assign blame to OpenMPI.

-Paul

Program terminated with signal SIGSEGV, Segmentation fault.

(gdb) where
#0  0x in ?? ()
#1  0x3fff897adb38 in intercept_munmap (start=0x3fff8967,
length=65536)
at
/home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
#2  0x3fff8933bc80 in __GI__IO_setb () from /lib64/libc.so.6
#3  0x3fff89339528 in __GI__IO_file_close_it () from /lib64/libc.so.6
#4  0x3fff89327f74 in fclose@@GLIBC_2.17 () from /lib64/libc.so.6
#5  0x1f7c in do_test ()
at
/home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
#6  0x100010e0 in main (argc=1, argv=0x3f332888)
at
/home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135

"start" is valid:
(gdb) print *(char*)0x3fff8967
$1 = 35 '#'

Frame 1:
155 opal_mem_hooks_release_hook (start, length, true);

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] [v2.x] more "patcher" issues

2016-05-06 Thread George Bosilca
We are getting extremely frequent C++ application deadlocks with the new
patcher. We are still investigating.

  George.


On Fri, May 6, 2016 at 12:14 PM, Paul Hargrove  wrote:

> I am testing a tarball built from v2.x-dev-1410-g81e0924
> This includes pull request #1128 in which Nathan addressed multiple
> "patcher" issues.
>
> However, I see the crash below in dlopen_test on a LITTLE-ENDIAN Power8
> system.
> This is happening when built with "V13.1.2 (5725-C73, 5765-J08)", but not
> with gcc on the same system.
> So, I cannot conclusively assign blame to OpenMPI.
>
> -Paul
>
> Program terminated with signal SIGSEGV, Segmentation fault.
>
> (gdb) where
> #0  0x in ?? ()
> #1  0x3fff897adb38 in intercept_munmap (start=0x3fff8967,
> length=65536)
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
> #2  0x3fff8933bc80 in __GI__IO_setb () from /lib64/libc.so.6
> #3  0x3fff89339528 in __GI__IO_file_close_it () from /lib64/libc.so.6
> #4  0x3fff89327f74 in fclose@@GLIBC_2.17 () from /lib64/libc.so.6
> #5  0x1f7c in do_test ()
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
> #6  0x100010e0 in main (argc=1, argv=0x3f332888)
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135
>
> "start" is valid:
> (gdb) print *(char*)0x3fff8967
> $1 = 35 '#'
>
> Frame 1:
> 155 opal_mem_hooks_release_hook (start, length, true);
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18928.php
>


Re: [OMPI devel] [v2.x] more "patcher" issues

2016-05-06 Thread Paul Hargrove
BIG-endian PPC64 w/ xlc V13.1 experiences a nearly identical failure.
However, this time gdb appears to have been able to resolve frame #0 to a
PLT slot (instead of "??").

-Paul

#0  0x0fff8904ef88 in 0010.plt_call.opal_mem_hooks_release_hook+0 ()
   from
/gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/INST/lib/libopen-pal.so.20
#1  0x0fff8910b630 in intercept_munmap (start=0xfff88d2,
length=2097152)
at
/gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
#2  0x00800cc5ca80 in ._IO_setb () from /lib64/libc.so.6
#3  0x00800cc5b16c in ._IO_file_close_it () from /lib64/libc.so.6
#4  0x00800cc4a758 in .fclose () from /lib64/libc.so.6
#5  0x1f88 in do_test ()
at
/gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
#6  0x100010d8 in main (argc=1, argv=0x462f398)
at
/gpfs-biou/phh1/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64-xlc-13.1/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135



On Fri, May 6, 2016 at 9:14 AM, Paul Hargrove  wrote:

> I am testing a tarball built from v2.x-dev-1410-g81e0924
> This includes pull request #1128 in which Nathan addressed multiple
> "patcher" issues.
>
> However, I see the crash below in dlopen_test on a LITTLE-ENDIAN Power8
> system.
> This is happening when built with "V13.1.2 (5725-C73, 5765-J08)", but not
> with gcc on the same system.
> So, I cannot conclusively assign blame to OpenMPI.
>
> -Paul
>
> Program terminated with signal SIGSEGV, Segmentation fault.
>
> (gdb) where
> #0  0x in ?? ()
> #1  0x3fff897adb38 in intercept_munmap (start=0x3fff8967,
> length=65536)
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:155
> #2  0x3fff8933bc80 in __GI__IO_setb () from /lib64/libc.so.6
> #3  0x3fff89339528 in __GI__IO_file_close_it () from /lib64/libc.so.6
> #4  0x3fff89327f74 in fclose@@GLIBC_2.17 () from /lib64/libc.so.6
> #5  0x1f7c in do_test ()
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:97
> #6  0x100010e0 in main (argc=1, argv=0x3f332888)
> at
> /home/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-ppc64el-xlc/openmpi-gitclone/ompi/debuggers/dlopen_test.c:135
>
> "start" is valid:
> (gdb) print *(char*)0x3fff8967
> $1 = 35 '#'
>
> Frame 1:
> 155 opal_mem_hooks_release_hook (start, length, true);
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [v2.x] Harmless type conversion warnings from Clang

2016-05-06 Thread Paul Hargrove
I don't think any of the warnings below indicate errors.
However, each could probably be suppressed with an appropriate cast.

-Paul

/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang/openmpi-gitclone/opal/mca/memory/patcher/memory_patcher_component.c:370:34:
warning: passing 'const void *' to parameter of type 'void *' discards
qualifiers [-Wincompatible-pointer-types-discards-qualifiers]
opal_mem_hooks_release_hook (shmaddr, memory_patcher_get_shm_seg_size
(shmaddr), false);
 ^~~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang/openmpi-gitclone/opal/mca/btl/openib/btl_openib_component.c:2124:21:
warning: implicit conversion from enumeration type
'btl_openib_receive_queues_source_t' to different enumeration type
'mca_base_var_source_t' [-Wenum-conversion]
BTL_OPENIB_RQ_SOURCE_DEVICE_INI;
^~~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang/openmpi-gitclone/opal/mca/pmix/pmix112/pmix1_client.c:406:19:
warning: implicit conversion from enumeration type 'opal_pmix_scope_t' to
different enumeration type 'pmix_scope_t' [-Wenum-conversion]
rc = PMIx_Put(scope, val->key, &kv);
  ^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang/openmpi-gitclone/ompi/mca/io/romio314/romio/adio/common/utils.c:97:3:
warning: passing 'const MPI_Aint *' (aka 'const long *') to parameter of
type 'MPI_Aint *' (aka 'long *') discards qualifiers
[-Wincompatible-pointer-types-discards-qualifiers]
array_of_displacements, oldtype, newtype);
^~

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [v2.x] printf format warnings w/ -m32

2016-05-06 Thread Paul Hargrove
The 96 printf format warnings in the attachment come from an Linux/x86-64
system w/ Clang and "-m32".

Some of the warnings are "size_t" vs "unigned long", which is harmless
since both are 32-bits.

However, there are several cases in sharedfp/sm where a 64-bit (long long)
format has a 32-bit (long) argument.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:487:49:
 warning: format specifies type 'long' but the argument has type 'size_t' (aka 
'unsigned int') [-Wformat]
   iov_ptr, iov_len_local,
^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:488:62:
 warning: format specifies type 'long' but the argument has type 'int' 
[-Wformat]
   pConvertor->pBaseBuf, conv_ptr + 
pElem->elem.disp - pConvertor->pBaseBuf,
 
^~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:489:52:
 warning: format specifies type 'long' but the argument has type 'ptrdiff_t' 
(aka 'int') [-Wformat]
   count_desc, 
description[pos_desc].elem.extent,
   
^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_convertor_raw.c:73:52:
 warning: format specifies type 'unsigned long' but the argument has type 
'size_t' (aka 'unsigned int') [-Wformat]
   (void*)iov, *iov_count, *length ); );
~~~^~~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:487:49:
 warning: format specifies type 'long' but the argument has type 'size_t' (aka 
'unsigned int') [-Wformat]
   iov_ptr, iov_len_local,
^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:488:62:
 warning: format specifies type 'long' but the argument has type 'int' 
[-Wformat]
   pConvertor->pBaseBuf, conv_ptr + 
pElem->elem.disp - pConvertor->pBaseBuf,
 
^~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/datatype/opal_datatype_unpack.c:489:52:
 warning: format specifies type 'long' but the argument has type 'ptrdiff_t' 
(aka 'int') [-Wformat]
   count_desc, 
description[pos_desc].elem.extent,
   
^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/mca/base/mca_base_var.c:1955:58:
 warning: format specifies type 'unsigned long' but the argument has type 
'size_t' (aka 'unsigned int') [-Wformat]
ret = asprintf (value_string, "%" PRIsize_t, value->sizetval);
   ~~~   ^~~
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/mca/hwloc/hwloc1112/hwloc/src/topology-synthetic.c:84:69:
 warning: format specifies type 'unsigned long' but the argument has type 
'size_t' (aka 'unsigned int') [-Wformat]
  fprintf(stderr, "Failed to read synthetic index #%lu at '%s'\n", i, 
attr);
   ~~~ ^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/mca/hwloc/hwloc1112/hwloc/src/topology-synthetic.c:92:76:
 warning: format specifies type 'unsigned long' but the argument has type 
'size_t' (aka 'unsigned int') [-Wformat]
fprintf(stderr, "Missing comma after synthetic index #%lu at 
'%s'\n", i, attr);
  ~~~   
  ^
--
/scratch/phargrov/OMPI/openmpi-v2.x-dev-1410-g81e0924-linux-x86_64-clang-m32/openmpi-gitclone/opal/mca/btl/openib/btl_openib_endpoint.c:115:40:
 warning: format specifies type 'unsigned long' but the argument has type 
'size_t' (aka 'unsigned int') [-Wformat]
   ib

Re: [OMPI devel] [PATCH] Fix for xlc-13.1.0 ICE (hwloc)

2016-05-06 Thread Brice Goglin
Thanks
I think I would be fine with that fix. Unfortunately I won't have a good
internet access until sunday night. I won't be able to test anything
properly earlier :/



Le 06/05/2016 00:29, Paul Hargrove a écrit :
> I have some good news:  I have a fix!!
>
> FWIW: I too can build w/ xlc 12.1 (also BG/Q).
> It is just the 13.1.0 on Power7 that crashes building hwloc.
> Meanwhile, 13.1.2 on Power8 little-endian does not crash (but is a
> different front-end than big-endian if I understand correctly).
>
> I started "bisecting" the file topology-xml-nolibxml.c and found that
> xlc is crashing on "__hwloc_attribute_may_alias".
> Simply disabling use of that attribute resolves the problem.
>
> So, here is the fix, which simply changes the check for this attribute
> to match the way in which hwloc uses it.
> It disqualifies the buggy compiler version(s) based on behavior,
> rather than us trying to list affected versions.
>
> --- config/hwloc_check_attributes.m4~   2016-05-05 17:18:10.380479303
> -0500
> +++ config/hwloc_check_attributes.m42016-05-05 17:21:30.399799031
> -0500
> @@ -322,9 +322,10 @@
>  # Attribute may_alias: No suitable cross-check available, that
> works for non-supporting compilers
>  # Ignored by intel-9.1.045 -- turn off with -wd1292
>  # Ignored by PGI-6.2.5; ignore not detected due to missing
> cross-check
> +# The test case is chosen to match hwloc's usage, and reproduces
> an xlc-13.1.0 bug.
>  #
>  _HWLOC_CHECK_SPECIFIC_ATTRIBUTE([may_alias],
> -[int * p_value __attribute__ ((__may_alias__));],
> +[struct { int i; } __attribute__ ((__may_alias__)) * p_value;],
>  [],
>  [])
>
>
> -Paul [proving that I am good for more than just *breaking* other
> people's software - I can fix things too]
>
> On Thu, May 5, 2016 at 2:28 PM, Jeff Squyres (jsquyres)
> mailto:jsquy...@cisco.com>> wrote:
>
> On May 5, 2016, at 5:27 PM, Josh Hursey  > wrote:
> >
> > Since this also happens with hwloc 1.11.3 standalone maybe hwloc
> folks can take point on further investigation?
>
> I think Brice would love your assistance in figuring this out,
> since I'm guessing he doesn't have access to these platforms,
> either.  :-)
>
> --
> Jeff Squyres
> jsquy...@cisco.com 
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org 
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2016/05/18917.php
>
>
>
>
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> 
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department   Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900



[OMPI devel] [2.0.0rc2] opal/mca/timer/aix?

2016-05-06 Thread Paul Hargrove
I see opal/mca/timer/aix is still around in 2.0.0rc2.
Does that mean Open MPI is expected to run on AIX, or is this directory an
orphan?

I have access to AIX-7.1.3 on Power7 h/w, and found that 2.0.0rc does NOT
build there out of the box.
Does anybody care?

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] [2.0.0rc] memory:patcher fragility

2016-05-06 Thread Paul Hargrove
I noticed that opal/mca/memory/patcher/memory_patcher_component.c includes
 without ever checking (not even in the .m4 fragment) that
this header exists.

At the moment, AIX is the only O/S I've encountered that doesn't have a
sys/syscall.h.
However, I think the possibility of others needs to be considered.
My recommendation is that the .m4 disqualify the component if sys/syscall.h
does not exist.

I was actually surprised that on AIX memory:patcher was compiled despite
all of the "no" results in the following:

--- MCA component memory:patcher (m4 configuration macro, priority 41)
checking for MCA component memory:patcher compile mode... static
checking for __curbrk symbol... no
checking whether __mmap prototype exists... yes
checking whether __mmap symbol exists... no
checking whether __syscall prototype exists... no
checking whether __syscall symbol exists... no
checking linux/mman.h usability... no
checking linux/mman.h presence... no
checking for linux/mman.h... no
checking if MCA component memory:patcher can compile... yes


-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900