Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Marco Atzeri

On 12/14/2014 12:06 AM, Ralph Castain wrote:

Hi folks

I’ve rolled up the bug fixes so far, including the thread-multiple
performance fix. So please give this one a whirl

http://www.open-mpi.org/software/ompi/v1.8/

Ralph



No regression on Cygwin 64 bit

Only and usual FAIL: atomic_cmpset_noinline.exe

Tested also OSU benchmarks 4.4.1
Only test failing (as already seen)
  mpi/pt2pt/osu_latency_mt.exe
  mpi/pt2pt/osu_multi_lat.exe

and I am not sure that I am correctly running them.
All the other tests are passed

./mpi/collective/osu_allgather.exe
./mpi/collective/osu_allgatherv.exe
./mpi/collective/osu_allreduce.exe
./mpi/collective/osu_alltoall.exe
./mpi/collective/osu_alltoallv.exe
./mpi/collective/osu_barrier.exe
./mpi/collective/osu_bcast.exe
./mpi/collective/osu_gather.exe
./mpi/collective/osu_gatherv.exe
./mpi/collective/osu_reduce.exe
./mpi/collective/osu_reduce_scatter.exe
./mpi/collective/osu_scatter.exe
./mpi/collective/osu_scatterv.exe
./mpi/one-sided/osu_acc_latency.exe
./mpi/one-sided/osu_cas_latency.exe
./mpi/one-sided/osu_fop_latency.exe
./mpi/one-sided/osu_get_acc_latency.exe
./mpi/one-sided/osu_get_bw.exe
./mpi/one-sided/osu_get_latency.exe
./mpi/one-sided/osu_put_bibw.exe
./mpi/one-sided/osu_put_bw.exe
./mpi/one-sided/osu_put_latency.exe
./mpi/pt2pt/osu_bibw.exe
./mpi/pt2pt/osu_bw.exe
./mpi/pt2pt/osu_latency.exe
./mpi/pt2pt/osu_mbw_mr.exe




Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Adrian Reber
1.8.4rc4 works without errors on my PSM based systems.

Adrian

On Sat, Dec 13, 2014 at 03:06:07PM -0800, Ralph Castain wrote:
> Hi folks
> 
> I’ve rolled up the bug fixes so far, including the thread-multiple 
> performance fix. So please give this one a whirl
> 
> http://www.open-mpi.org/software/ompi/v1.8/ 
> 
> 
> Ralph
> 

> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/12/16586.php


Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Paul Hargrove
On Sun, Dec 14, 2014 at 10:52 PM, Paul Hargrove  wrote:
>
> Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for
> C++ apps (but OK for C).
> I will report in more details when I have more information.
>

First the good news:

The problem I was experiencing (with the Solaris Studio compilers) turned
out to be "pilot error".
I had added "-library=stlport4" to CXXFLAGS but neglected to add the same
in --with-wrapper-cxxflags.
Adding to both has always sort of bothered me, and this time it bit me.
Oddly, the problem didn't appear until I forced static libs.

Now the bad news:

By trying more variants on my Solaris platforms I was able to get TWO new
failure modes.
However, I have a fix for one.

1)
Still Solaris-10/SPARC and "--enable-static --disable-shared"  but this
time with gcc-3.4.6.
With this configuration I get Bus Errors from "make check" that do not
occur without these configure options:

bash: line 5:  3141 Bus Error   (core dumped) ${dir}$tst
FAIL: position
bash: line 5:  3221 Bus Error   (core dumped) ${dir}$tst
FAIL: position_noncontig


Examining the core from the second failure:

t@1 (l@1) program terminated by signal BUS (invalid address alignment)
Current function is main
  208   opal_pack_debug = 0;
(dbx) print _pack_debug
_pack_debug = 0x10092e169


The problem seems to be that the tests declare this (and others) as an int,
but the opal headers say bool:

$ gegrep  -r '^extern .* opal_(pack|unpack|position)_debug' .
./test/datatype/position.c:extern int opal_unpack_debug;
./test/datatype/position.c:extern int opal_pack_debug;
./test/datatype/position.c:extern int opal_position_debug ;
./test/datatype/position_noncontig.c:extern int opal_unpack_debug;
./test/datatype/position_noncontig.c:extern int opal_pack_debug;
./test/datatype/position_noncontig.c:extern int opal_position_debug ;
./opal/datatype/opal_convertor_internal.h:extern bool opal_pack_debug;
./opal/datatype/opal_datatype_position.c:extern bool opal_position_debug;

Defn of opal_unpack_debug is well hidden, but is also "bool".

Correcting "int" to "bool" for those 3 vars in the two tests resolved this
problem for me.



2)
Now on my Solaris-11/x86-64 system with both GigE and IPoIB interfaces.
I am seeing the following when using the Solaris Studio compilers (Gnu
compilers were fine):

$ mpirun -mca btl sm,self,openib -np 2 -host pcp-j-19,pcp-j-20
examples/ring_c'
[pcp-j-20:16239] mca_oob_tcp_accept: accept() failed: Error 0 (0).

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:pcp-j-20
  Remote host:   172.18.0.120
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.



Notice the "Error 0 (0)" which means errno=0 and suggests that we've not
properly linked the thread-safe C libraries (recall that there is one
thread per interface and these hosts have two).
I see "-D_REENTRANT" in the output of "make".
However, the man pages suggest that one also needs "-mt=yes" in *both* the
compile and link steps (it defines _REENTRANT and links the proper libs).

I hoped that I could resolve this failure by adding LDFLAGS=-mt=yes to the
configure command.
However, that didn't work.


-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


Re: [OMPI devel] 1.8.4rc4 now out for testing

2014-12-15 Thread Paul Hargrove
My testing on 1.8.4rc4 is not quite done, but is getting close.
With two exceptions, so far all looks good to me on almost 60 different
platforms.

I've retested on my Solaris systems and saw none of the issues I had with
rc3.
The x86-64/Linux system with mtl:psm is no longer giving a SEGV at exit.

My QEMU-based Linux/ARM and Linux/MIPS testers were OK with rc3, but I've
not yet completed testing rc4 (too slow).

The "two exceptions":

#1
I *am* still manually passing --without-xpmem on the SGI UV.
If I don't do so then the build fails as describe in
http://www.open-mpi.org/community/lists/devel/2014/12/16520.php

#2
Solaris-10/SPARC and "--enable-static --disable-shared" appears broken for
C++ apps (but OK for C).
I will report in more details when I have more information.

-Paul

On Sat, Dec 13, 2014 at 3:06 PM, Ralph Castain  wrote:
>
> Hi folks
>
> I've rolled up the bug fixes so far, including the thread-multiple
> performance fix. So please give this one a whirl
>
> http://www.open-mpi.org/software/ompi/v1.8/
>
> Ralph
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/12/16586.php
>


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Computer Languages & Systems Software (CLaSS) Group
Computer Science Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


[OMPI devel] 1.8.4rc4 now out for testing

2014-12-13 Thread Ralph Castain
Hi folks

I’ve rolled up the bug fixes so far, including the thread-multiple performance 
fix. So please give this one a whirl

http://www.open-mpi.org/software/ompi/v1.8/ 


Ralph