date:20140110

[OMPI devel] Paul's testing summary

2014-01-10 Thread Paul Hargrove

This is an attempt to summarize the status of the trunk and 1.7.4rc with
respect to my testing.

There are 6 issues that to the best of my knowledge have not yet been
conclusively closed.
There might still be others buried in my sea of test results.

1. opal/util/path.c
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2014/01/13597.php
Jeff and I have fixed this in trunk and Jeff  CMRed for 1.7.4.
CMR was committed to v1.7 (changeset 30256) and thus just made the v1.7
tarball tonight.
Closed for trunk.
Closure for v1.7 just depends on me to test.

2. oshem_info reports oshmem:bindings:fort:yes unconditionally
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2014/01/13616.php
and restarted in
http://www.open-mpi.org/community/lists/devel/2014/01/13677.php
Mike Dubman indicated he will fix this for trunk.
This does NOT apply to v1.7 (no oshmem).

3. configure refuses btl:verbs on Solaris
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2014/01/13598.php
Jeff has indicated he will look into this one on trunk.
This does NOT apply to v1.7.

4. oob:tcp not using loopback interface for single-node runs
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2014/01/13655.php
Ralph and I determined that the reported issue was due to the firewall on
my hosts blocking app-daemon connections.
Can work around via "-mca oob_tcp_if_include lo"
Ralph *may* see about a way to use loopback by default, but probably not
prior to 1.7.5
This issue is present both in trunk and v1.7

5. pgi-8 and pgi-9 fail building mpi_f08
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2014/01/13651.php
Jeff is actively working to improve configure tests to disqualify these
compilers.
This issue as initially reported is present in v1.7
In trunk the same issue is present for pgi-9, but is worse (configure
"Cannot continue") for pgi-8

6. netbsd-amd64 "make install" failure
See thread beginning with
http://www.open-mpi.org/community/lists/devel/2013/12/13515.php
The issue does NOT appear on netbsd-i386 (reason unknown)
My attempts to autogen with the netbsd-supplied libtool turned up another
(now resolved) issue, but didn't fix this one.
Nobody has even commented on this issue.
This issue is present both in trunk and v1.7

As far as I am concerned only #1 *must* be resolved for 1.7.4, and I am
going to do my part ASAP.
Items #2 and #3 are trunk-only.
Resolving #4 would be nice, but has a simple work around and is an issue
only on a "broken" host.
Resolving #5 would be great, but IMHO documenting these compilers as
unsupported for mpi_f08 would be sufficient.
Resolving #6 seems unlikely given the level of interest so far.


-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] trunk build failure on {Free,Net,Open}BSD

2014-01-10 Thread Paul Hargrove

Jeff and I iterated a bit off-list and opal/util/path.c in tonight's trunk
tarball (1.9a1r30255) works for all of my systems.
With the help of Jeff's recently-enhanced test/util/opal_path_nfs.c I was
able to verify that NFS mounts are now correctly identified on the *BSD
systems (and still correct on Linux, Mac OSX, and Solaris).

Marco,
  Can you please verify on Cygwin?

-Paul



On Fri, Jan 10, 2014 at 6:34 AM, Jeff Squyres (jsquyres)  wrote:

> On Jan 10, 2014, at 9:18 AM, "Jeff Squyres (jsquyres)" 
> wrote:
>
> >> It seems to indicate that even if one does find a statfs() function,
> there are multiple os-dependent versions and it should therefore be
> avoided.  Since statvfs() is defined by POSIX, it should be preferred.
> >
> > Sounds good; I'll do that.
>
> Gah.  The situation gets murkier.  I see in OS X Mountain Lion and
> Mavericks man pages for statvfs() where they describe the fields in struct
> statvfs:
>
>f_fsid Not meaningful in this implementation.
>
> This is the field I need out of struct statvfs to know what the file
> system magic number is.  Arrgh!
>
> I'll keep looking into what would be a good solution here...
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] trunk - ibverbs configure error on Solaris-11

2014-01-10 Thread Paul Hargrove

FYI: still present in tonight's trunk tarball (1.9a1r30255).
Don't know if was expected to be fixed or not.

-Paul


On Thu, Jan 9, 2014 at 2:24 PM, Paul Hargrove  wrote:

> Jeff,
>
> The requested config.log was attached
> as openmpi-trunk-solaris11-x64-ib-gcc452-config.log.bz2 in my recent
> response to the usnic-on-solaris thread:
> http://www.open-mpi.org/community/lists/devel/2014/01/13637.php
>
> It looks like there were 2 successful probes for ibv_open_device() before
> the failing one.
> The failing one says:
> configure:284324: checking for ibv_open_device in -libverbs
> configure:284349: gcc -std=gnu99 -o conftest -O3 -DNDEBUG -m64
> -finline-functions -fno-strict-aliasing -pthread -I$(top_srcdir)
> -I$(top_builddir) -I$(top_srcdir)/opal/include -I$(top_srcdir)/orte/include
> -I$(top_srcdir)/ompi/include -I$(top_srcdir)/oshmem/include
>  
> -I/shared/OMPI/openmpi-trunk-solaris11-x64-ib-gcc452/openmpi-1.9a1r30146/opal/mca/hwloc/hwloc172/hwloc/include
> -I/shared/OMPI/openmpi-trunk-solaris11-x64-ib-gcc452/BLD/opal/mca/hwloc/hwloc172/hwloc/include
> -I/shared/OMPI/openmpi-trunk-solaris11-x64-ib-gcc452/openmpi-1.9a1r30146/opal/mca/event/libevent2021/libevent
> -I/shared/OMPI/openmpi-trunk-solaris11-x64-ib-gcc452/openmpi-1.9a1r30146/opal/mca/event/libevent2021/libevent/include
> -I/shared/OMPI/openmpi-trunk-solaris11-x64-ib-gcc452/BLD/opal/mca/event/libevent2021/libevent/include
> -export-dynamicconftest.c -libverbs   -lsocket -lnsl  -lm  -lsocket
> -lnsl  -lm  >&5
> ld: fatal: entry point symbol 'xport-dynamic' is undefined
> collect2: ld returned 1 exit status
> configure:284349: $? = 1
>
> So, it looks line a bogus "-export-dynamic" argument to gcc is at fault
> here.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 2:05 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Paul --
>>
>> Can you send the config.log file from this?  It has more details in it
>> that will be useful (e.g., *why* ibv_open_device wasn't found in -libverbs).
>>
>> I wonder if the issue has to do something with our handling of the legacy
>> --with-openib switch...? (it's been deprecated on the trunk in favor of
>> --with-verbs)
>>
>>
>> On Jan 8, 2014, at 8:38 PM, Paul Hargrove  wrote:
>>
>> > When trying to configure the OMPI trunk on a Solaris-11/x86-64 with
>> --enable-openib, I see the following error not seen with the 1.7 branch:
>> >
>> > *** Compiler flags
>> > checking which of CFLAGS are ok for debugger modules...  -DNDEBUG -m64
>> -mt
>> > checking for debugger extra CFLAGS... -g
>> > checking for the C compiler vendor... (cached) sun
>> > checking if want to add padding to the openib BTL control header... no
>> > checking for fcntl.h... (cached) yes
>> > checking for sys/poll.h... (cached) yes
>> > checking infiniband/verbs.h usability... yes
>> > checking infiniband/verbs.h presence... yes
>> > checking for infiniband/verbs.h... yes
>> > looking for library without search path
>> > checking for ibv_open_device in -libverbs... no
>> > checking if ConnectX XRC support is enabled... no
>> > checking if dynamic SL is enabled... no
>> > configure: WARNING: Verbs support requested (via --with-verbs) but not
>> found.
>> > configure: WARNING: If you are using libibverbs v1.0 (i.e., OFED v1.0
>> or v1.1), you *MUST* have both the libsysfs headers and libraries
>> installed.  Later versions of libibverbs do not require libsysfs.
>> > configure: error: Aborting.
>> >
>> > This is despite an earlier:
>> > checking if MCA component btl:openib can compile... yes
>> >
>> > The error above is with the Solaris Studio compilers, but the same
>> happens with the GNU compilers.
>> > The (compressed) full configure output for the GNU case is attached.
>> >
>> > This is a regression relative to the current 1.7.4rc, in which the
>> openib btl works fine on Solaris-11/x86-64, by which I mean I can configure
>> with --enable-openib and the following command line works:
>> > mpirun -host pcp-j-19,pcp-j-20 -mca btl openib,self -np 2
>> examples/ring_c
>> >
>> > My best guess is that either the libsysfs requirement itself OR its
>> associated test is Linux-specific.
>> >
>> > -Paul
>> >
>> > --
>> > Paul H. Hargrove  phhargr...@lbl.gov
>> > Future Technologies Group
>> > Computer and Data Sciences Department Tel: +1-510-495-2352
>> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> > ___
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> --
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and

Re: [OMPI devel] trunk - build failure on OpenBSD

2014-01-10 Thread Paul Hargrove

Appears to be fixed in tonight's trunk tarball (1.9a1r30255).

Thanks,
-Paul


On Fri, Jan 10, 2014 at 7:03 AM, Jeff Squyres (jsquyres)  wrote:

> This looks like how we handled this issue elsewhere in the OMPI code base,
> too.
>
> Mellanox: in the interest of getting another good tarball today, since
> it's the weekend for you, I'll apply this patch.
>
> (thanks Paul!)
>
>
> On Jan 10, 2014, at 2:20 AM, Paul Hargrove  wrote:
>
> > Based on how MAP_ANONYMOUS vs MAP_ANON is dealt with in
> opal/mca/memory/linux/malloc.c,  I believe the patch below is an
> appropriate solution for this issue.  Additionally, it handles the
> possibility that MAP_FAILED is not defined (not sure where that comes up,
> but opal/mca/memory/linux/malloc.c allows for it).
> >
> > -Paul
> >
> > Index: oshmem/mca/memheap/base/memheap_base_alloc.c
> > ===
> > --- oshmem/mca/memheap/base/memheap_base_alloc.c(revision 30223)
> > +++ oshmem/mca/memheap/base/memheap_base_alloc.c(working copy)
> > @@ -18,6 +18,12 @@
> >  #ifdef HAVE_SYS_MMAN_H
> >  #include 
> >  #endif
> > +#if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
> > +# define MAP_ANONYMOUS MAP_ANON
> > +#endif
> > +#if !defined(MAP_FAILED)
> > +# define MAP_FAILED ((char*)-1)
> > +#endif
> >
> >  #include 
> >  #include 
> > @@ -278,10 +284,8 @@
> >  size,
> >  PROT_READ | PROT_WRITE,
> >  MAP_SHARED |
> > -#if defined (__APPLE__)
> > -MAP_ANON |
> > -#elif defined (__GNUC__)
> > -MAP_ANONYMOUS |
> > +#ifdef MAP_ANONYMOUS
> > +MAP_ANONYMOUS |
> >  #endif
> >  MAP_FIXED,
> >  0,
> >
> >
> >
> >
> > On Thu, Jan 9, 2014 at 8:35 PM, Paul Hargrove 
> wrote:
> > Same issue for NetBSD, too.
> >
> > -Paul
> >
> >
> > On Thu, Jan 9, 2014 at 7:09 PM, Paul Hargrove 
> wrote:
> > With the new opal/util/path.c I get farther building the trunk on
> OpenBSD but hit a new failure:
> >
> > Making all in mca/memheap
> >   CC   base/memheap_base_frame.lo
> >   CC   base/memheap_base_select.lo
> >   CC   base/memheap_base_alloc.lo
> >
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:
> In function '_mmap_attach':
> >
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
> error: 'MAP_ANONYMOUS' undeclared (first use in this function)
> >
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
> error: (Each undeclared identifier is reported only once
> >
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
> error: for each function it appears in.)
> > *** Error 1 in oshmem/mca/memheap (Makefile:1631
> 'base/memheap_base_alloc.lo': @echo "  CC  "
> base/memheap_base_alloc.lo;depbase=`echo b...)
> > *** Error 1 in oshmem (Makefile:1962 'all-recursive')
> > *** Error 1 in /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/BLD
> (Makefile:1685 'all-recursive')
> >
> > On OpenBSD one must use MAP_ANON rather than MAP_ANONYMOUS.
> >
> > -Paul
> >
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Future Technologies Group
> > Computer and Data Sciences Department Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> >
> >
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Future Technologies Group
> > Computer and Data Sciences Department Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> >
> >
> >
> > --
> > Paul H. Hargrove  phhargr...@lbl.gov
> > Future Technologies Group
> > Computer and Data Sciences Department Tel: +1-510-495-2352
> > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

[hwloc-devel] Create success (hwloc git dev-34-g5198d4c)

2014-01-10 Thread MPI Team

Creating nightly hwloc snapshot git tarball was a success.

Snapshot:   hwloc dev-34-g5198d4c
Start time: Fri Jan 10 21:01:01 EST 2014
End time:   Fri Jan 10 21:03:36 EST 2014

Your friendly daemon,
Cyrador

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Jeff Squyres (jsquyres)

Yes, I'm pretty sure we've seen that before, and it was ID'ed as either a local 
configuration issue or a PGI bug.


On Jan 10, 2014, at 7:51 PM, Paul Hargrove  wrote:

> 
> 
> 
> On Fri, Jan 10, 2014 at 4:46 PM, Paul Hargrove  wrote:
> 
> On Fri, Jan 10, 2014 at 4:43 PM, Jeff Squyres (jsquyres)  
> wrote:
> Don't worry about PGI 11.  I'm happy enough knowing that PGI 12 works.
> 
> Test is already running to satisfy my own curiosity.
> But I'll only post the result if something fails.
> 
> With pgi-11.1 something DID fail:
> 
>   CCLD libopen-pal.la
> /usr/bin/ld: 
> /global/common/carver/usg/pgi/11.1/linux86-64/11.1/lib/libpgbind.a(bindsa.o): 
> relocation R_X86_64_PC32 against `syscall@@GLIBC_2.2.5' can not be used when 
> making a shared object; recompile with -fPIC
> /usr/bin/ld: final link failed: Bad value
> make[2]: *** [libopen-pal.la] Error 2
> 
> This looks like a PGI bug.
> So, I'll try again for a pgi-11.x with x > 1.
> 
> -Paul 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 4:46 PM, Paul Hargrove  wrote:

>
> On Fri, Jan 10, 2014 at 4:43 PM, Jeff Squyres (jsquyres) <
> jsquy...@cisco.com> wrote:
>
>> Don't worry about PGI 11.  I'm happy enough knowing that PGI 12 works.
>
>
> Test is already running to satisfy my own curiosity.
> But I'll only post the result if something fails.
>

With pgi-11.1 something DID fail:

  CCLD libopen-pal.la
/usr/bin/ld:
/global/common/carver/usg/pgi/11.1/linux86-64/11.1/lib/libpgbind.a(bindsa.o):
relocation R_X86_64_PC32 against `syscall@@GLIBC_2.2.5' can not be used
when making a shared object; recompile with -fPIC
/usr/bin/ld: final link failed: Bad value
make[2]: *** [libopen-pal.la] Error 2

This looks like a PGI bug.
So, I'll try again for a pgi-11.x with x > 1.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 4:43 PM, Jeff Squyres (jsquyres)  wrote:

> Don't worry about PGI 11.  I'm happy enough knowing that PGI 12 works.


Test is already running to satisfy my own curiosity.
But I'll only post the result if something fails.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Jeff Squyres (jsquyres)

Don't worry about PGI 11.  I'm happy enough knowing that PGI 12 works.

On Jan 10, 2014, at 6:59 PM, Paul Hargrove  wrote:

> Jeff,
> 
> I said earlier that PGI *12* has build mpi_f08 correctly in response to Larry 
> baker asking about 11 and 12.
> I don't have a PGI 11 config on my list at the moment, but would be surprised 
> if I can't find one.
> I will look for a PGI 11, but am focused on the opal_path_nfs() stuff at the 
> moment.
> 
> -Paul
> 
> 
> On Fri, Jan 10, 2014 at 3:56 PM, Jeff Squyres (jsquyres)  
> wrote:
> On Jan 10, 2014, at 6:45 PM, Paul Hargrove  wrote:
> 
> > Keep in mind that I have no specific reason to think pgi-10 should be 
> > accepted for building mpi_f08.
> > My only observation was that it seemed to be rejected w/ less configure 
> > testing than was applied to accept 8.0 and 9.0.
> 
> Got it.
> 
> I see the reason, and it's weird.
> 
> PGI 8 and 9 support Fortran IGNORE TKR syntax, and it looks like PGI 10 does 
> not.  Truly odd.
> 
> Do you have the PGI 11 compiler?  I thought someone said earlier that mpi_f08 
> worked with their PGI 11 compiler (which means that ignore TKR came back in 
> PGI 11 -- maybe it was just a bug in your rev of PGI 10?).
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Paul Hargrove

Jeff,

I said earlier that PGI *12* has build mpi_f08 correctly in response to
Larry baker asking about 11 and 12.
I don't have a PGI 11 config on my list at the moment, but would be
surprised if I can't find one.
I will look for a PGI 11, but am focused on the opal_path_nfs() stuff at
the moment.

-Paul


On Fri, Jan 10, 2014 at 3:56 PM, Jeff Squyres (jsquyres)  wrote:

> On Jan 10, 2014, at 6:45 PM, Paul Hargrove  wrote:
>
> > Keep in mind that I have no specific reason to think pgi-10 should be
> accepted for building mpi_f08.
> > My only observation was that it seemed to be rejected w/ less configure
> testing than was applied to accept 8.0 and 9.0.
>
> Got it.
>
> I see the reason, and it's weird.
>
> PGI 8 and 9 support Fortran IGNORE TKR syntax, and it looks like PGI 10
> does not.  Truly odd.
>
> Do you have the PGI 11 compiler?  I thought someone said earlier that
> mpi_f08 worked with their PGI 11 compiler (which means that ignore TKR came
> back in PGI 11 -- maybe it was just a bug in your rev of PGI 10?).
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 6:45 PM, Paul Hargrove  wrote:

> Keep in mind that I have no specific reason to think pgi-10 should be 
> accepted for building mpi_f08.
> My only observation was that it seemed to be rejected w/ less configure 
> testing than was applied to accept 8.0 and 9.0.

Got it.

I see the reason, and it's weird.

PGI 8 and 9 support Fortran IGNORE TKR syntax, and it looks like PGI 10 does 
not.  Truly odd.

Do you have the PGI 11 compiler?  I thought someone said earlier that mpi_f08 
worked with their PGI 11 compiler (which means that ignore TKR came back in PGI 
11 -- maybe it was just a bug in your rev of PGI 10?).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 3:33 PM, Jeff Squyres (jsquyres)  wrote:

> Can you send the output from pgi-10?  We don't reject based on compiler
> name/version -- it should be all behavior-based checks...

Attached.

Keep in mind that I have no specific reason to think pgi-10 should be
accepted for building mpi_f08.
My only observation was that it seemed to be rejected w/ less configure
testing than was applied to accept 8.0 and 9.0.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

openmpi-1.7-latest-linux-x86_64-pgi-10.0-config.log.bz2
Description: BZip2 compressed data

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 1:26 PM, Paul Hargrove  wrote:

> OMPI's configure says pgi-8.0 and pgi-9.0 are "good".
> But pgi-10.0 is rejected without even subjecting it to the tests.
> This situation (8.0 and 9.0 "better" than 10.0) sounds fishy to me.

That's true.

Can you send the output from pgi-10?  We don't reject based on compiler 
name/version -- it should be all behavior-based checks...

> You didn't miss anything because I was focused on the idea that mpi_f08 
> shouldn't even have been attempted on these compilers.  See below for the 
> pgi-9.0 error messages.  8.0 was similar but output has been lost (scratch 
> f/s expiry).

This was enough for me to figure out what I think the issue is.

I was doing one BIND(C) configure test -- it looks like I need to do some 
additional variations of the BIND(C) test.  With these additional tests, I'll 
bet that we'll rule that we won't build the mpi_f08 module with pgi 8/9.

I should have something checked into the trunk soon (for tonight's tarball).  
Let's see how that does before we bring it over to v1.7 -- we might need to 
iterate once or twice before getting it right.

Thank you!

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain

I *believe* oob can now support virtual interfaces, but can't swear to it - 
only very lightly tested on my box.

I'll mark this in for resolving in 1.7.5


On Jan 10, 2014, at 1:55 PM, Paul Hargrove  wrote:

> Ralph,
> 
> Since this turned out to be a matter of an unsupported system configuration, 
> it is my opinion that this doesn't need to be addressed for 1.7.4 if it would 
> cause any further delay.
> 
> Also, I noticed this system has lo and lo:0.
> I know the TCP BTL doesn't support virtual interfaces (trac ticket 3339).
> So, I mention it here in case oob:tcp has similar issues.
> 
> -Paul
> 
> 
> On Fri, Jan 10, 2014 at 1:02 PM, Ralph Castain  wrote:
> 
> On Jan 10, 2014, at 12:59 PM, Paul Hargrove  wrote:
> 
>> Ralph,
>> 
>> This is the front end of a production cluster at NERSC.
>> So, I would not be surprised if there is a fairly restrictive firewall 
>> configuration in place.
>> However, I could't find a way to query the configuration.
>> 
> 
> Aha - indeed, that is the problem.
> 
>> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
>> 
>> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
>> Is there some reason why the loopback interface is not being used 
>> automatically for the single-host case?
>> That would seem to be a straight-forward solution to this issue.
> 
> Yeah, we should do a better job of that - I'll take a look and see what can 
> be done in the near term.
> 
> Thanks!
> Ralph
> 
>> 
>> -Paul
>> 
>> 
>> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain  wrote:
>> Bingo - the proc can't send a message to the daemon to tell it "i'm alive 
>> and need my nidmap data". I suspect we'll find that your headnode isn't 
>> allowing us to open a socket for communication between two processes on it, 
>> and we don't have (yet) a pipe-like mechanism to replace it.
>> 
>> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - 
>> should see the oob indicate that it fails to make the connection back to the 
>> daemon
>> 
>> 
>> On Jan 10, 2014, at 12:33 PM, Paul Hargrove  wrote:
>> 
>>> Ralph,
>>> 
>>> Configuring using a proper --with-tm=... I find that I *can* run a 
>>> singleton in an allocation ("qsub -I -l nodes=1 ").
>>> The case of a singleton on the front end is still failing.
>>> 
>>> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 
>>> -mca odls_base_verbose 5" is attached.
>>> 
>>> -Paul
>>> 
>>> 
>>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:
>>> 
>>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
>>> 
 On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove  wrote:
 
 On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
 ??? that was it? Was this built with --enable-debug?
 
 Nope, I missed --enable-debug.  Will try again.
 
 
 OK, Take-2 below.
 There is an obvious "recipient list is empty!" in the output.
>>> 
>>> That one is correct and expected - all it means is that you are running on 
>>> only one node, so mpirun doesn't need to relay messages to another daemon
>>> 
 
 -Paul
 
 $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca 
 orte_nidmap_verbose 10 examples/ring_c'
 [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
 [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set 
 priority to 10
 [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
 [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
 [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
 [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 
 1
 [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
 [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is 
 empty!
 [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
 [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 
 bytes
 [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
 [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] 
 CONTRIBUTE 2
 [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
 [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] 
 CONTRIBUTE 2
 [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
 [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] 
 CONTRIBUTE 2
 [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
 [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set 
 priority to 10
 [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
 [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
 [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data

[OMPI devel] NUMA bug in openib BTL device selection

2014-01-10 Thread Rolf vandeVaart

I believe I found a bug in openib BTL and just want to see if folks agree with 
this.  When we are running on a NUMA node and we are bound to a CPU, we only 
ant to use the IB device that is closest to us.  However, I observed that we 
always used both devices regardless.  I believe there is a bug in computing the 
distances and the below change fixes it.   This was introduced with r26391 when 
we switched to using hwloc to determine distances.  It is a simple error where 
we are supposed to be accessing the array with i+j*size.

With this change, we will only use the IB devices that are close to us.

Any comments?  Otherwise, I will commit.

Rolf

Index: ompi/mca/btl/openib/btl_openib_component.c
===
--- ompi/mca/btl/openib/btl_openib_component.c  (revision 30175)
+++ ompi/mca/btl/openib/btl_openib_component.c  (working copy)
@@ -2202,10 +2202,10 @@
 if (NULL != my_obj) {
 /* Distance may be asymetrical, so calculate both of them
and take the max */
-a = hwloc_distances->latency[my_obj->logical_index *
+a = hwloc_distances->latency[my_obj->logical_index +
  (ibv_obj->logical_index * 
   hwloc_distances->nbobjs)];
-b = hwloc_distances->latency[ibv_obj->logical_index *
+b = hwloc_distances->latency[ibv_obj->logical_index +
  (my_obj->logical_index * 
   hwloc_distances->nbobjs)];
 distance = (a > b) ? a : b;
@@ -2224,10 +2224,10 @@
 ibv_obj->cpuset, 
 HWLOC_OBJ_NODE, 
++i)) {
 
-a = hwloc_distances->latency[node_obj->logical_index *
+a = hwloc_distances->latency[node_obj->logical_index +
  (ibv_obj->logical_index * 
   hwloc_distances->nbobjs)];
-b = hwloc_distances->latency[ibv_obj->logical_index *
+b = hwloc_distances->latency[ibv_obj->logical_index +
  (node_obj->logical_index * 
   hwloc_distances->nbobjs)];
 a = (a > b) ? a : b;
[rvandevaart@drossetti-ivy0 ompi-trunk-gpu-topo]$ 
---
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
---

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

Ralph,

Since this turned out to be a matter of an unsupported system
configuration, it is my opinion that this doesn't need to be addressed for
1.7.4 if it would cause any further delay.

Also, I noticed this system has lo and lo:0.
I know the TCP BTL doesn't support virtual interfaces (trac ticket 3339).
So, I mention it here in case oob:tcp has similar issues.

-Paul


On Fri, Jan 10, 2014 at 1:02 PM, Ralph Castain  wrote:

>
> On Jan 10, 2014, at 12:59 PM, Paul Hargrove  wrote:
>
> Ralph,
>
> This is the front end of a production cluster at NERSC.
> So, I would not be surprised if there is a fairly restrictive firewall
> configuration in place.
> However, I could't find a way to query the configuration.
>
>
> Aha - indeed, that is the problem.
>
>
> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
>
> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
> Is there some reason why the loopback interface is not being used
> automatically for the single-host case?
> That would seem to be a straight-forward solution to this issue.
>
>
> Yeah, we should do a better job of that - I'll take a look and see what
> can be done in the near term.
>
> Thanks!
> Ralph
>
>
> -Paul
>
>
> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain  wrote:
>
>> Bingo - the proc can't send a message to the daemon to tell it "i'm alive
>> and need my nidmap data". I suspect we'll find that your headnode isn't
>> allowing us to open a socket for communication between two processes on it,
>> and we don't have (yet) a pipe-like mechanism to replace it.
>>
>> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line -
>> should see the oob indicate that it fails to make the connection back to
>> the daemon
>>
>>
>> On Jan 10, 2014, at 12:33 PM, Paul Hargrove  wrote:
>>
>> Ralph,
>>
>> Configuring using a proper --with-tm=... I find that I *can* run a
>> singleton in an allocation ("qsub -I -l nodes=1 ").
>> The case of a singleton on the front end is still failing.
>>
>> The verbose output using "-mca state_base_verbose 5 -mca
>> plm_base_verbose 5 -mca odls_base_verbose 5" is attached.
>>
>> -Paul
>>
>>
>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:
>>
>>>
>>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
>>>
>>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove wrote:
>>>

 On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain wrote:

> ??? that was it? Was this built with --enable-debug?


 Nope, I missed --enable-debug.  Will try again.


>>> OK, Take-2 below.
>>> There is an obvious "recipient list is empty!" in the output.
>>>
>>>
>>> That one is correct and expected - all it means is that you are running
>>> on only one node, so mpirun doesn't need to relay messages to another daemon
>>>
>>>
>>> -Paul
>>>
>>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca
>>> orte_nidmap_verbose 10 examples/ring_c'
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set
>>> priority to 10
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0]
>>> tag 1
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list
>>> is empty!
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55
>>> bytes
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>>> CONTRIBUTE 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>>> CONTRIBUTE 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>>> CONTRIBUTE 2
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set
>>> priority to 10
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key
>>> not found in file
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>> at line 503
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set
>>> priority to 10
>>>

Re: [OMPI devel] [EXTERNAL] Re: 1.7.4rc2r30168 - configure failure on Mac OSX 10.5

2014-01-10 Thread Barrett, Brian W

Agreed, let's drop 10.5. I don't want to fix that bug given it's likely 
customer base...

Brian



Sent with Good (www.good.com)


-Original Message-
From: Ralph Castain [r...@open-mpi.org]
Sent: Friday, January 10, 2014 08:14 AM Mountain Standard Time
To: Open MPI Developers
Subject: [EXTERNAL] Re: [OMPI devel] 1.7.4rc2r30168 - configure failure on Mac 
OSX 10.5

And we do appreciate your breakage! :-)

I think we'll just drop 10.5 from the list as that's very old and likely not 
worth fixing


On Jan 9, 2014, at 4:50 PM, Paul Hargrove 
> wrote:

Ralph,

I can build fine on 10.7 (the system I am typing on now), and on 10.6 too.

I have no strong opinion on fix-vs-document, but as Jeff knows quite well if 
you say you support it I am going to try to make it break :).

-Paul


On Thu, Jan 9, 2014 at 4:46 PM, Ralph Castain 
> wrote:
I dunno if we really go back that far, Paul - I doubt anyone has tested on 
anything less than 10.8, frankly. Might be better if we update to not make 
claims that far back.

Were you able to build/run on 10.7?

On Jan 9, 2014, at 3:25 PM, Paul Hargrove 
> wrote:

As I noted in another email, 1.7.4's README claims support for Mac OSX versions 
10.5 through 10.7.  So, I just now tried (but failed) to build on 10.5 
(Leopard):

*** Assembler
checking dependency style of gcc -std=gnu99... gcc3
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -p
checking the name lister (/usr/bin/nm -p) interface... BSD nm
checking for fgrep... /usr/bin/grep -F
checking if need to remove -g from CCASFLAGS... OS X Leopard - yes ( -O3 
-DNDEBUG -finline-functions -fno-strict-aliasing)
checking whether to enable smp locks... yes
checking if .proc/endp is needed... no
checking directive for setting text section... .text
checking directive for exporting symbols... .globl
checking for objdump... no
checking if .note.GNU-stack is needed... no
checking suffix for labels... :
checking prefix for global symbol labels... none
configure: error: Could not determine global symbol label prefix

The same failure is seen on a PPC system running OSX Leopard, too.  However, I 
figure it best to focus on getting x86 working first before worrying any about 
PPC.

The only configure option used was --prefix.
The bzip2-compressed config.log is attached.

-Paul

--
Paul H. Hargrove  
phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: 
+1-510-495-2352
Lawrence Berkeley National Laboratory Fax: 
+1-510-486-6900
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Paul H. Hargrove  
phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] [EXTERNAL] Re: MX and PSM in 1.7.4

2014-01-10 Thread Barrett, Brian W

I'm not actually sure about MX. I was testing, but since the last release our 
machine has been retired. So it's possible we're missing coverage there.

Brian



Sent with Good (www.good.com)


-Original Message-
From: Ralph Castain [r...@open-mpi.org]
Sent: Thursday, January 09, 2014 09:56 PM Mountain Standard Time
To: Open MPI Developers
Subject: [EXTERNAL] Re: [OMPI devel] MX and PSM in 1.7.4

So far as I know, yes - still being tested and used. Glad to hear you could 
validate the QLogic stuff - I don't know about Myrinet, but imagine someone 
will shout if it has an issue



On Jan 9, 2014, at 5:52 PM, Paul Hargrove 
> wrote:

Is anybody still testing MX and PSM?
They are both still present in ompi/mca/mtl/

I have access to a system w/ QLogic HCAs w/ PSM and have verified that I can:
$ mpirun -mca btl sm,self -np 2 -host n15,n16 -mca mtl psm examples/ring_c

I have an x86 (32-bit) system w/ MX headers and libs that I have successfully 
configured and built on.
However, I no longer have Myrinet h/w (well, there is some in a box in the 
machine room but my dedication to Open MPI rc testing doesn't extend far enough 
to install the h/w).

-Paul

--
Paul H. Hargrove  
phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: 
+1-510-495-2352
Lawrence Berkeley National Laboratory Fax: 
+1-510-486-6900
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain


On Jan 10, 2014, at 12:59 PM, Paul Hargrove  wrote:

> Ralph,
> 
> This is the front end of a production cluster at NERSC.
> So, I would not be surprised if there is a fairly restrictive firewall 
> configuration in place.
> However, I could't find a way to query the configuration.
> 

Aha - indeed, that is the problem.

> The verbose output with (only) "-mca oob_base_verbose 10" is attached.
> 
> On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
> Is there some reason why the loopback interface is not being used 
> automatically for the single-host case?
> That would seem to be a straight-forward solution to this issue.

Yeah, we should do a better job of that - I'll take a look and see what can be 
done in the near term.

Thanks!
Ralph

> 
> -Paul
> 
> 
> On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain  wrote:
> Bingo - the proc can't send a message to the daemon to tell it "i'm alive and 
> need my nidmap data". I suspect we'll find that your headnode isn't allowing 
> us to open a socket for communication between two processes on it, and we 
> don't have (yet) a pipe-like mechanism to replace it.
> 
> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - 
> should see the oob indicate that it fails to make the connection back to the 
> daemon
> 
> 
> On Jan 10, 2014, at 12:33 PM, Paul Hargrove  wrote:
> 
>> Ralph,
>> 
>> Configuring using a proper --with-tm=... I find that I *can* run a singleton 
>> in an allocation ("qsub -I -l nodes=1 ").
>> The case of a singleton on the front end is still failing.
>> 
>> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 
>> -mca odls_base_verbose 5" is attached.
>> 
>> -Paul
>> 
>> 
>> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:
>> 
>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
>> 
>>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove  wrote:
>>> 
>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
>>> ??? that was it? Was this built with --enable-debug?
>>> 
>>> Nope, I missed --enable-debug.  Will try again.
>>> 
>>> 
>>> OK, Take-2 below.
>>> There is an obvious "recipient list is empty!" in the output.
>> 
>> That one is correct and expected - all it means is that you are running on 
>> only one node, so mpirun doesn't need to relay messages to another daemon
>> 
>>> 
>>> -Paul
>>> 
>>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca 
>>> orte_nidmap_verbose 10 examples/ring_c'
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
>>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is 
>>> empty!
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 
>>> 2
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not 
>>> found in file 
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>>  at line 503
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set 
>>> priority to 10
>>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not 
>>> found in file 
>>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>>  at line 503
>> 
>> 
>> This is very weird - it appears that your procs are looking for hostname 
>> data prior to receiving

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

Ralph,

This is the front end of a production cluster at NERSC.
So, I would not be surprised if there is a fairly restrictive firewall
configuration in place.
However, I could't find a way to query the configuration.

The verbose output with (only) "-mca oob_base_verbose 10" is attached.

On a hunch, I tried adding "-mca oob_tcp_if_include lo" and IT WORKS!
Is there some reason why the loopback interface is not being used
automatically for the single-host case?
That would seem to be a straight-forward solution to this issue.

-Paul


On Fri, Jan 10, 2014 at 12:43 PM, Ralph Castain  wrote:

> Bingo - the proc can't send a message to the daemon to tell it "i'm alive
> and need my nidmap data". I suspect we'll find that your headnode isn't
> allowing us to open a socket for communication between two processes on it,
> and we don't have (yet) a pipe-like mechanism to replace it.
>
> Can verify that by putting "-mca oob_base_verbose 10" on the cmd line -
> should see the oob indicate that it fails to make the connection back to
> the daemon
>
>
> On Jan 10, 2014, at 12:33 PM, Paul Hargrove  wrote:
>
> Ralph,
>
> Configuring using a proper --with-tm=... I find that I *can* run a
> singleton in an allocation ("qsub -I -l nodes=1 ").
> The case of a singleton on the front end is still failing.
>
> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose
> 5 -mca odls_base_verbose 5" is attached.
>
> -Paul
>
>
> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:
>
>>
>> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
>>
>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove wrote:
>>
>>>
>>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain wrote:
>>>
 ??? that was it? Was this built with --enable-debug?
>>>
>>>
>>> Nope, I missed --enable-debug.  Will try again.
>>>
>>>
>> OK, Take-2 below.
>> There is an obvious "recipient list is empty!" in the output.
>>
>>
>> That one is correct and expected - all it means is that you are running
>> on only one node, so mpirun doesn't need to relay messages to another daemon
>>
>>
>> -Paul
>>
>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca
>> orte_nidmap_verbose 10 examples/ring_c'
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set
>> priority to 10
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0]
>> tag 1
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is
>> empty!
>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55
>> bytes
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>> CONTRIBUTE 2
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>> CONTRIBUTE 2
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
>> CONTRIBUTE 2
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set
>> priority to 10
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not
>> found in file
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>> at line 503
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set
>> priority to 10
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not
>> found in file
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>> at line 503
>>
>>
>>
>> This is very weird - it appears that your procs are looking for hostname
>> data prior to receiving the necessary data. Let's try jacking up the debug,
>> I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca
>> odls_base_verbose 5"
>>
>> Sorry that will be rather wordy, but I don't understand the ordering you
>> show above. It's like your procs are skipping a bunch of steps in the
>> startup procedure.
>>
>> Out of curiosity, if

Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain


On Jan 10, 2014, at 12:45 PM, Adrian Reber  wrote:

> On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
>> 
>> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
>> 
>>> I am currently trying to understand how callbacks are working. Right now
>>> I am looking at orte/mca/rml/base/rml_base_receive.c
>>> orte_rml_base_comm_start() which does 
>>> 
>>>   orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>>>   ORTE_RML_TAG_RML_INFO_UPDATE,
>>>   ORTE_RML_PERSISTENT,
>>>   orte_rml_base_recv,
>>>   NULL);
>>> 
>>> As far as I understand it orte_rml_base_recv() is the callback function.
>>> At which point should this function run? When the data is actually
>>> received?
>> 
>> Not precisely. When data is received by the OOB, it pushes the data into an 
>> event. When that event gets serviced, it calls the orte_rml_base_receive 
>> function which processes the data to find the matching tag, and then uses 
>> that to execute the callback to the user code.
>> 
>>> 
>>> The same for send_buffer_nb() functions. I do not see the callback
>>> functions actually running. How can I verify that the callback functions
>>> are running. Especially for the send case it sounds pretty obvious how
>>> it should work but I never see the callback function running. At least
>>> in my setup.
>> 
>> The data is not immediately sent. It gets pushed into an event. When that 
>> event gets serviced, it calls the orte_oob_base_send function which then 
>> passes the data to each active OOB component until one of them says it can 
>> send it. The data is then pushed into another event to get it into the event 
>> base for that component's active module - when that event gets serviced, the 
>> data is sent. Once the data is sent, an event is created that, when 
>> serviced, executes the callback to the user code.
>> 
>> If you aren't seeing callbacks, the most likely cause is that the orte 
>> progress thread isn't running. Without it, none of this will work.
> 
> Thanks. Running configure without '--with-ft=cr' I can run a program and
> use orte-top. In orterun I can see that the callback is running and
> orte-top displays the retrieved information. I can also see in orte-top
> that the callbacks are working.

Actually, I'm rather impressed - I hadn't tested orte-top and didn't honestly 
know if it would work any more! Glad to hear it does :-)

> Doing the same with '--with-ft=cr'
> enabled orte-top crashes as well as orte-checkpoint and both (-top and
> -checkpoint) seem to no longer have working callbacks and that is why
> they are probably crashing. So some code which is enabled by '--with-ft=cr'
> seems to break callbacks in orte-top as well as in orte-checkpoint.
> orterun handles callbacks no matter if configured with or without
> '--with-ft=cr'.

I can take a look this weekend - probably something silly

> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] callback debugging

2014-01-10 Thread Adrian Reber

On Fri, Jan 10, 2014 at 09:48:14AM -0800, Ralph Castain wrote:
> 
> On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:
> 
> > I am currently trying to understand how callbacks are working. Right now
> > I am looking at orte/mca/rml/base/rml_base_receive.c
> > orte_rml_base_comm_start() which does 
> > 
> >orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
> >ORTE_RML_TAG_RML_INFO_UPDATE,
> >ORTE_RML_PERSISTENT,
> >orte_rml_base_recv,
> >NULL);
> > 
> > As far as I understand it orte_rml_base_recv() is the callback function.
> > At which point should this function run? When the data is actually
> > received?
> 
> Not precisely. When data is received by the OOB, it pushes the data into an 
> event. When that event gets serviced, it calls the orte_rml_base_receive 
> function which processes the data to find the matching tag, and then uses 
> that to execute the callback to the user code.
> 
> > 
> > The same for send_buffer_nb() functions. I do not see the callback
> > functions actually running. How can I verify that the callback functions
> > are running. Especially for the send case it sounds pretty obvious how
> > it should work but I never see the callback function running. At least
> > in my setup.
> 
> The data is not immediately sent. It gets pushed into an event. When that 
> event gets serviced, it calls the orte_oob_base_send function which then 
> passes the data to each active OOB component until one of them says it can 
> send it. The data is then pushed into another event to get it into the event 
> base for that component's active module - when that event gets serviced, the 
> data is sent. Once the data is sent, an event is created that, when serviced, 
> executes the callback to the user code.
> 
> If you aren't seeing callbacks, the most likely cause is that the orte 
> progress thread isn't running. Without it, none of this will work.

Thanks. Running configure without '--with-ft=cr' I can run a program and
use orte-top. In orterun I can see that the callback is running and
orte-top displays the retrieved information. I can also see in orte-top
that the callbacks are working. Doing the same with '--with-ft=cr'
enabled orte-top crashes as well as orte-checkpoint and both (-top and
-checkpoint) seem to no longer have working callbacks and that is why
they are probably crashing. So some code which is enabled by '--with-ft=cr'
seems to break callbacks in orte-top as well as in orte-checkpoint.
orterun handles callbacks no matter if configured with or without
'--with-ft=cr'.

Adrian

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain

Bingo - the proc can't send a message to the daemon to tell it "i'm alive and 
need my nidmap data". I suspect we'll find that your headnode isn't allowing us 
to open a socket for communication between two processes on it, and we don't 
have (yet) a pipe-like mechanism to replace it.

Can verify that by putting "-mca oob_base_verbose 10" on the cmd line - should 
see the oob indicate that it fails to make the connection back to the daemon


On Jan 10, 2014, at 12:33 PM, Paul Hargrove  wrote:

> Ralph,
> 
> Configuring using a proper --with-tm=... I find that I *can* run a singleton 
> in an allocation ("qsub -I -l nodes=1 ").
> The case of a singleton on the front end is still failing.
> 
> The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5 
> -mca odls_base_verbose 5" is attached.
> 
> -Paul
> 
> 
> On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:
> 
> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
> 
>> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove  wrote:
>> 
>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
>> ??? that was it? Was this built with --enable-debug?
>> 
>> Nope, I missed --enable-debug.  Will try again.
>> 
>> 
>> OK, Take-2 below.
>> There is an obvious "recipient list is empty!" in the output.
> 
> That one is correct and expected - all it means is that you are running on 
> only one node, so mpirun doesn't need to relay messages to another daemon
> 
>> 
>> -Paul
>> 
>> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca 
>> orte_nidmap_verbose 10 examples/ring_c'
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
>> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
>> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is 
>> empty!
>> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
>> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
>> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
>> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>  at line 503
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set 
>> priority to 10
>> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
>> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
>> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>>  at line 503
> 
> 
> This is very weird - it appears that your procs are looking for hostname data 
> prior to receiving the necessary data. Let's try jacking up the debug, I 
> guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca 
> odls_base_verbose 5"
> 
> Sorry that will be rather wordy, but I don't understand the ordering you show 
> above. It's like your procs are skipping a bunch of steps in the startup 
> procedure.
> 
> Out of curiosity, if you do have an allocation on run on it, does it work?
> 
>> 
>>  
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

Ralph,

Configuring using a proper --with-tm=... I find that I *can* run a
singleton in an allocation ("qsub -I -l nodes=1 ").
The case of a singleton on the front end is still failing.

The verbose output using "-mca state_base_verbose 5 -mca plm_base_verbose 5
-mca odls_base_verbose 5" is attached.

-Paul


On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:

>
> On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:
>
> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove wrote:
>
>>
>> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
>>
>>> ??? that was it? Was this built with --enable-debug?
>>
>>
>> Nope, I missed --enable-debug.  Will try again.
>>
>>
> OK, Take-2 below.
> There is an obvious "recipient list is empty!" in the output.
>
>
> That one is correct and expected - all it means is that you are running on
> only one node, so mpirun doesn't need to relay messages to another daemon
>
>
> -Paul
>
> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca
> orte_nidmap_verbose 10 examples/ring_c'
> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set
> priority to 10
> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag
> 1
> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is
> empty!
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55
> bytes
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
> CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
> CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1]
> CONTRIBUTE 2
> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set
> priority to 10
> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not
> found in file
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
> at line 503
> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set
> priority to 10
> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not
> found in file
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
> at line 503
>
>
>
> This is very weird - it appears that your procs are looking for hostname
> data prior to receiving the necessary data. Let's try jacking up the debug,
> I guess - add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca
> odls_base_verbose 5"
>
> Sorry that will be rather wordy, but I don't understand the ordering you
> show above. It's like your procs are skipping a bunch of steps in the
> startup procedure.
>
> Out of curiosity, if you do have an allocation on run on it, does it work?
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900


log-fe.bz2
Description: BZip2 compressed data

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 12:12 PM, Ralph Castain  wrote:

> Out of curiosity, if you do have an allocation on run on it, does it work?
>

This is a TORQUE-managed cluster and configure didn't find TM headers/libs.
So, I didn't even consider trying inside an allocation.
I will build with the necessary --with-tm and see if running inside an
allocation works.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain


On Jan 10, 2014, at 11:04 AM, Paul Hargrove  wrote:

> On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove  wrote:
> 
> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
> ??? that was it? Was this built with --enable-debug?
> 
> Nope, I missed --enable-debug.  Will try again.
> 
> 
> OK, Take-2 below.
> There is an obvious "recipient list is empty!" in the output.

That one is correct and expected - all it means is that you are running on only 
one node, so mpirun doesn't need to relay messages to another daemon

> 
> -Paul
> 
> $ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca 
> orte_nidmap_verbose 10 examples/ring_c'
> [cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
> [cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
> [cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is 
> empty!
> [cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
> [cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
> [cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE 2
> [cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
> [cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not 
> found in file 
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>  at line 503
> [cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
> [cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not 
> found in file 
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>  at line 503


This is very weird - it appears that your procs are looking for hostname data 
prior to receiving the necessary data. Let's try jacking up the debug, I guess 
- add "-mca state_base_verbose 5 -mca plm_base_verbose 5 -mca odls_base_verbose 
5"

Sorry that will be rather wordy, but I don't understand the ordering you show 
above. It's like your procs are skipping a bunch of steps in the startup 
procedure.

Out of curiosity, if you do have an allocation on run on it, does it work?

> 
>  
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Paul H build on Solaris

2014-01-10 Thread Paul Hargrove

On Thu, Jan 9, 2014 at 12:35 PM, Jeff Squyres (jsquyres)  wrote:

> Thanks.  We're just going to change the test in the usnic BTL to be
> explicit about only building on 64 bit Linux.
>

Last night's trunk did NOT try to build btl:usnic on Solaris.
So, this issue looks to be resolved in trunk.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 10:41 AM, Paul Hargrove  wrote:

>
> On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:
>
>> ??? that was it? Was this built with --enable-debug?
>
>
> Nope, I missed --enable-debug.  Will try again.
>
>
OK, Take-2 below.
There is an obvious "recipient list is empty!" in the output.

-Paul

$ mpirun -mca btl sm,self -np 2 -mca grpcomm_base_verbose 5 -mca
orte_nidmap_verbose 10 examples/ring_c'
[cvrsvc01:21200] mca:base:select:(grpcomm) Querying component [bad]
[cvrsvc01:21200] mca:base:select:(grpcomm) Query of component [bad] set
priority to 10
[cvrsvc01:21200] mca:base:select:(grpcomm) Selected component [bad]
[cvrsvc01:21200] [[45961,0],0] grpcomm:base:receive start comm
[cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
[cvrsvc01:21200] [[45961,0],0] grpcomm:bad:xcast sent to job [45961,0] tag 1
[cvrsvc01:21200] [[45961,0],0] grpcomm:xcast:recv: with 1135 bytes
[cvrsvc01:21200] [[45961,0],0] orte:daemon:send_relay - recipient list is
empty!
[cvrsvc01:21200] [[45961,0],0] orte:util:encode_nidmap
[cvrsvc01:21200] [[45961,0],0] orte:util:build:daemon:nidmap packed 55 bytes
[cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 0
[cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE
2
[cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 1
[cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE
2
[cvrsvc01:21200] [[45961,0],0] PROGRESSING COLL id 2
[cvrsvc01:21200] [[45961,0],0] ALL LOCAL PROCS FOR JOB [45961,1] CONTRIBUTE
2
[cvrsvc01:21202] mca:base:select:(grpcomm) Querying component [bad]
[cvrsvc01:21202] mca:base:select:(grpcomm) Query of component [bad] set
priority to 10
[cvrsvc01:21202] mca:base:select:(grpcomm) Selected component [bad]
[cvrsvc01:21202] [[45961,1],0] grpcomm:base:receive start comm
[cvrsvc01:21202] [[45961,1],0] ORTE_ERROR_LOG: Data for specified key not
found in file
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
at line 503
[cvrsvc01:21203] mca:base:select:(grpcomm) Querying component [bad]
[cvrsvc01:21203] mca:base:select:(grpcomm) Query of component [bad] set
priority to 10
[cvrsvc01:21203] mca:base:select:(grpcomm) Selected component [bad]
[cvrsvc01:21203] [[45961,1],1] grpcomm:base:receive start comm
[cvrsvc01:21203] [[45961,1],1] ORTE_ERROR_LOG: Data for specified key not
found in file
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64-gcc/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
at line 503


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 10:08 AM, Ralph Castain  wrote:

> ??? that was it? Was this built with --enable-debug?


Nope, I missed --enable-debug.  Will try again.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 7:49 AM, Jeff Squyres (jsquyres)  wrote:

> Paul --
>
> The output from configure looks ok to me.  We're testing for the various
> capabilities of the fortran compiler that we need, most of which have been
> around for quite a while.  One Big New Thing that isn't around yet is the
> type(*), dimension(..) notation, which no fortran compiler supports yet.
>  But *most* of the other new MPI-3 Fortran behavior has been around since
> F2003 (maybe earlier? I'm no expert).  (I glossed over a few details here,
> but you get the point)
>
> Hence, it's not entirely surprising to me that we're determining that an
> "old" compiler is ok to build the mpi_f08 module.
>

OMPI's configure says pgi-8.0 and pgi-9.0 are "good".
But pgi-10.0 is rejected without even subjecting it to the tests.
This situation (8.0 and 9.0 "better" than 10.0) sounds fishy to me.


>
> Can you send the output from what happens when you try to build?  (or did
> I miss that in another post?)
>
>
You didn't miss anything because I was focused on the idea that mpi_f08
shouldn't even have been attempted on these compilers.  See below for the
pgi-9.0 error messages.  8.0 was similar but output has been lost (scratch
f/s expiry).

-Paul

Making all in mpi/fortran/base/
make[2]: Entering directory
`/global/u1/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/BLD/ompi/mpi/fortran/base'
  CC   libmpi_fortran_base_la-attr_fn_f.lo
  CC   libmpi_fortran_base_la-conversion_fn_null_f.lo
  CC   libmpi_fortran_base_la-f90_accessors.lo
  CC   libmpi_fortran_base_la-strings.lo
  CC   libmpi_fortran_base_la-test_constants_f.lo
  CCLD libmpi_fortran_base.la
  PPFC mpi-f08-types.lo
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
96)
PGF90-W-0119-Redundant specification for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
97)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
97)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
99)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
99)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
101)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
101)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
102)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
102)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
104)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
104)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
106)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
106)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
108)
PGF90-S-0037-Contradictory data type specified for protected
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
109)
PGF90-S-0034-Syntax error at or near =
(/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-9.0/openmpi-1.7.4rc2r30168/ompi/mpi/fortran/base/mpi-f08-types.F90:
109)
PGF90-S-0037-Contradictory data type specified for protected

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain

??? that was it? Was this built with --enable-debug?


On Jan 10, 2014, at 10:03 AM, Paul Hargrove  wrote:

> 
> 
> 
> On Fri, Jan 10, 2014 at 7:12 AM, Ralph Castain  wrote:
> Very strange. Try adding "-mca grpcomm_base_verbose 5 -mca 
> orte_nidmap_verbose 10" to your cmd line with the trunk version and let's see 
> what may be happening
> 
> Most of my systems don't have new enough autotools to work from svn.
> If it is critical I could setup to rsync from one of my systems that *can* 
> autogen.
> 
> So, this is from last night's trunk tarball (1.9a1r30215):
> 
> $ mpirun -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 -np 1 
> examples/ring_c 2>&1 | tee log
> [cvrsvc01:29185] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:29185] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [cvrsvc01:29185] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:29188] mca:base:select:(grpcomm) Querying component [bad]
> [cvrsvc01:29188] mca:base:select:(grpcomm) Query of component [bad] set 
> priority to 10
> [cvrsvc01:29188] mca:base:select:(grpcomm) Selected component [bad]
> [cvrsvc01:29188] [[37720,1],0] ORTE_ERROR_LOG: Data for specified key not 
> found in file 
> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
>  at line 503
> 
> 
> 
> Any chance of library confusion here?
> 
> I just verified using /proc//maps on the hung orterun and ring_c 
> processes that the only shared libs mapped in are the systems ones in /lib64 
> and the ones from the fresh install of Open MPI.  No stale libs from old OMPI 
> builds.
> 
> -Paul
> 
>  
> 
> On Jan 9, 2014, at 9:57 PM, Paul Hargrove  wrote:
> 
>> The problem is seen with both the trunk and the 1.7.4rc tarball.
>> 
>> -Paul
>> 
>> 
>> On Thu, Jan 9, 2014 at 9:23 PM, Paul Hargrove  wrote:
>> 
>> On Thu, Jan 9, 2014 at 8:56 PM, Paul Hargrove  wrote:
>> I'll try a gcc-based build on one of the systems ASAP.
>> 
>> Sorry, Ralph:  the failure remains when built w/ gcc.
>> Let me know what to try next and I'll give it a shot.
>> 
>> -Paul
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

On Fri, Jan 10, 2014 at 7:12 AM, Ralph Castain  wrote:

> Very strange. Try adding "-mca grpcomm_base_verbose 5 -mca
> orte_nidmap_verbose 10" to your cmd line with the trunk version and let's
> see what may be happening
>

Most of my systems don't have new enough autotools to work from svn.
If it is critical I could setup to rsync from one of my systems that *can*
autogen.

So, this is from last night's trunk tarball (1.9a1r30215):

$ mpirun -mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 10 -np 1
examples/ring_c 2>&1 | tee log
[cvrsvc01:29185] mca:base:select:(grpcomm) Querying component [bad]
[cvrsvc01:29185] mca:base:select:(grpcomm) Query of component [bad] set
priority to 10
[cvrsvc01:29185] mca:base:select:(grpcomm) Selected component [bad]
[cvrsvc01:29188] mca:base:select:(grpcomm) Querying component [bad]
[cvrsvc01:29188] mca:base:select:(grpcomm) Query of component [bad] set
priority to 10
[cvrsvc01:29188] mca:base:select:(grpcomm) Selected component [bad]
[cvrsvc01:29188] [[37720,1],0] ORTE_ERROR_LOG: Data for specified key not
found in file
/global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-trunk-linux-x86_64/openmpi-1.9a1r30215/orte/runtime/orte_globals.c
at line 503



> Any chance of library confusion here?
>

I just verified using /proc//maps on the hung orterun and ring_c
processes that the only shared libs mapped in are the systems ones in
/lib64 and the ones from the fresh install of Open MPI.  No stale libs from
old OMPI builds.

-Paul



>
> On Jan 9, 2014, at 9:57 PM, Paul Hargrove  wrote:
>
> The problem is seen with both the trunk and the 1.7.4rc tarball.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 9:23 PM, Paul Hargrove  wrote:
>
>>
>> On Thu, Jan 9, 2014 at 8:56 PM, Paul Hargrove  wrote:
>>
>>> I'll try a gcc-based build on one of the systems ASAP.
>>
>>
>> Sorry, Ralph:  the failure remains when built w/ gcc.
>> Let me know what to try next and I'll give it a shot.
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>  ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] callback debugging

2014-01-10 Thread Ralph Castain

On Jan 10, 2014, at 8:02 AM, Adrian Reber  wrote:

> I am currently trying to understand how callbacks are working. Right now
> I am looking at orte/mca/rml/base/rml_base_receive.c
> orte_rml_base_comm_start() which does 
> 
>orte_rml.recv_buffer_nb(ORTE_NAME_WILDCARD,
>ORTE_RML_TAG_RML_INFO_UPDATE,
>ORTE_RML_PERSISTENT,
>orte_rml_base_recv,
>NULL);
> 
> As far as I understand it orte_rml_base_recv() is the callback function.
> At which point should this function run? When the data is actually
> received?

Not precisely. When data is received by the OOB, it pushes the data into an 
event. When that event gets serviced, it calls the orte_rml_base_receive 
function which processes the data to find the matching tag, and then uses that 
to execute the callback to the user code.

> 
> The same for send_buffer_nb() functions. I do not see the callback
> functions actually running. How can I verify that the callback functions
> are running. Especially for the send case it sounds pretty obvious how
> it should work but I never see the callback function running. At least
> in my setup.

The data is not immediately sent. It gets pushed into an event. When that event 
gets serviced, it calls the orte_oob_base_send function which then passes the 
data to each active OOB component until one of them says it can send it. The 
data is then pushed into another event to get it into the event base for that 
component's active module - when that event gets serviced, the data is sent. 
Once the data is sent, an event is created that, when serviced, executes the 
callback to the user code.

If you aren't seeing callbacks, the most likely cause is that the orte progress 
thread isn't running. Without it, none of this will work.

> 
>   Adrian
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [hwloc-devel] Use of

2014-01-10 Thread Jeff Squyres (jsquyres)

Sweet; thanks.

On Jan 10, 2014, at 12:25 PM, Brice Goglin  wrote:

> Looks like we're good.
> Brice
> 
> 
> 
> Le 10/01/2014 18:05, Jeff Squyres (jsquyres) a écrit :
>> K, will do.
>> 
>> On Jan 10, 2014, at 12:00 PM, Brice Goglin 
>> wrote:
>> 
>>> Push it to master, we'll what regression testing at 
>>> https://ci.inria.fr/hwloc/job/master-1-check/ thinks about it
>>> Brice
>>> 
>>> 
>>> 
>>> "Jeff Squyres (jsquyres)"  a écrit :
>>> Brice / Samuel --
>>> 
>>> In http://www.open-mpi.org/community/lists/devel/2014/01/13619.php, Paul 
>>> Hargrove found this compiler warning:
>>> 
>>> -
>>> On OpenBSD the header malloc.h exists, but is NOT intended to be used:
>>> -bash-4.2$ grep -B2 'is obsolete' make.log 
>>> CC   bind.lo
>>> In file included from 
>>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/opal/mca/hwloc/hwloc172/hwloc/src/bind.c:17:
>>> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
>>> "
>>> -
>>> 
>>> What do you think of this patch (or something like it)?
>>> 
>>> diff --git a/src/bind.c b/src/bind.c
>>> index 046b7cf..37921bc 100644
>>> --- a/src/bind.c
>>> +++ b/src/bind.c
>>> @@ -13,8 +13,9 @@
>>> #ifdef HAVE_SYS_MMAN_H
>>> #  include 
>>> #endif
>>> -#ifdef HAVE_MALLOC_H
>>> -
>>> # 
>>> include 
>>> 
>>> +/*  is only needed if we don't have posix_memalign() */
>>> +#if defined(hwloc_getpagesize) && !defined(HAVE_POSIX_MEMALIGN) && 
>>> defined(HAVE_MEMALIGN) && defined(HAVE_MALLOC_H)
>>> +#include 
>>> #endif
>>> #ifdef HAVE_UNISTD_H
>>> #include 
>>> 
>>> ___
>>> hwloc-devel mailing list
>>> hwloc-de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>> 
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [hwloc-devel] Use of

2014-01-10 Thread Brice Goglin

Looks like we're good.
Brice



Le 10/01/2014 18:05, Jeff Squyres (jsquyres) a écrit :
> K, will do.
>
> On Jan 10, 2014, at 12:00 PM, Brice Goglin 
>  wrote:
>
>> Push it to master, we'll what regression testing at 
>> https://ci.inria.fr/hwloc/job/master-1-check/ thinks about it
>> Brice
>>
>>
>>
>> "Jeff Squyres (jsquyres)"  a écrit :
>> Brice / Samuel --
>>
>> In http://www.open-mpi.org/community/lists/devel/2014/01/13619.php, Paul 
>> Hargrove found this compiler warning:
>>
>> -
>> On OpenBSD the header malloc.h exists, but is NOT intended to be used:
>> -bash-4.2$ grep -B2 'is obsolete' make.log 
>> CC   bind.lo
>> In file included from 
>> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/opal/mca/hwloc/hwloc172/hwloc/src/bind.c:17:
>> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
>> "
>> -
>>
>> What do you think of this patch (or something like it)?
>>
>> diff --git a/src/bind.c b/src/bind.c
>> index 046b7cf..37921bc 100644
>> --- a/src/bind.c
>> +++ b/src/bind.c
>> @@ -13,8 +13,9 @@
>> #ifdef HAVE_SYS_MMAN_H
>> #  include 
>> #endif
>> -#ifdef HAVE_MALLOC_H
>> -
>>  # 
>> include 
>>
>> +/*  is only needed if we don't have posix_memalign() */
>> +#if defined(hwloc_getpagesize) && !defined(HAVE_POSIX_MEMALIGN) && 
>> defined(HAVE_MEMALIGN) && defined(HAVE_MALLOC_H)
>> +#include 
>> #endif
>> #ifdef HAVE_UNISTD_H
>> #include 
>>
>> ___
>> hwloc-devel mailing list
>> hwloc-de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel
>

Re: [OMPI devel] 1.7.4rc2r30168 - misc harmless *BSD warnings

2014-01-10 Thread Jeff Squyres (jsquyres)

Fixed all of these except:

- pushed hwloc fix upstream and waiting for equivalent of hwloc MTT testing to 
see how it fares
- we try not to edit ROMIO since it comes from upstream (i.e., we tolerate 
warnings in there)


On Jan 9, 2014, at 3:48 AM, Paul Hargrove  wrote:

> Some minor misc warnings from the current 1.7.4rc tarball:
> 
> On both FreeBSD and NetBSD, the symbol CACHE_LINE_SIZE used in 
> ompi/mca/bcol/basesmuma/ appears to clash with a system-defined symbol.
> FreeBSD-9/x86-64:
>   CC   bcol_basesmuma_component.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7.4rc2r30168/ompi/mca/bcol/basesmuma/bcol_basesmuma_component.c:25:
> /home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7.4rc2r30168/ompi/mca/bcol/basesmuma/bcol_basesmuma.h:51:1:
>  warning: "CACHE_LINE_SIZE" redefined
> In file included from /usr/include/sys/param.h:131,
>  from 
> /home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7.4rc2r30168/opal/include/opal_config_bottom.h:366,
>  from ../../../../opal/include/opal_config.h:2518,
>  from ../../../../ompi/include/ompi_config.h:28,
>  from 
> /home/phargrov/OMPI/openmpi-1.7-latest-freebsd9-amd64/openmpi-1.7.4rc2r30168/ompi/mca/bcol/basesmuma/bcol_basesmuma_component.c:16:
> /usr/include/machine/param.h:89:1: warning: this is the location of the 
> previous definition
> NetBSD-5/x86:
>   CC   bcol_basesmuma_component.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-netbsd6-i386/openmpi-1.7.4rc2r30168/ompi/mca/bcol/basesmuma/bcol_basesmuma_component.c:25:0:
> /home/phargrov/OMPI/openmpi-1.7-latest-netbsd6-i386/openmpi-1.7.4rc2r30168/ompi/mca/bcol/basesmuma/bcol_basesmuma.h:51:0:
>  warning: "CACHE_LINE_SIZE" redefined
> /usr/include/sys/param.h:184:0: note: this is the location of the previous 
> definition
> 
> 
> On OpenBSD the header malloc.h exists, but is NOT intended to be used:
> -bash-4.2$ grep -B2 'is obsolete' make.log 
>   CC   bind.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/opal/mca/hwloc/hwloc172/hwloc/src/bind.c:17:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> --
>   CC   base/mpool_base_frame.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/ompi/mca/mpool/base/mpool_base_frame.c:28:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> --
>   CC   base/mpool_base_lookup.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/ompi/mca/mpool/base/mpool_base_lookup.c:29:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> --
>   CC   adio/common/malloc.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/ompi/mca/io/romio/romio/adio/common/malloc.c:24:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> --
>   CC   mpool_grdma_module.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/ompi/mca/mpool/grdma/mpool_grdma_module.c:34:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> --
>   CC   mpool_grdma_component.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/ompi/mca/mpool/grdma/mpool_grdma_component.c:34:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [hwloc-devel] Use of

2014-01-10 Thread Jeff Squyres (jsquyres)

K, will do.

On Jan 10, 2014, at 12:00 PM, Brice Goglin 
 wrote:

> Push it to master, we'll what regression testing at 
> https://ci.inria.fr/hwloc/job/master-1-check/ thinks about it
> Brice
> 
> 
> 
> "Jeff Squyres (jsquyres)"  a écrit :
> Brice / Samuel --
> 
> In http://www.open-mpi.org/community/lists/devel/2014/01/13619.php, Paul 
> Hargrove found this compiler warning:
> 
> -
> On OpenBSD the header malloc.h exists, but is NOT intended to be used:
> -bash-4.2$ grep -B2 'is obsolete' make.log 
> CC   bind.lo
> In file included from 
> /home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/opal/mca/hwloc/hwloc172/hwloc/src/bind.c:17:
> /usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
> "
> -
> 
> What do you think of this patch (or something like it)?
> 
> diff --git a/src/bind.c b/src/bind.c
> index 046b7cf..37921bc 100644
> --- a/src/bind.c
> +++ b/src/bind.c
> @@ -13,8 +13,9 @@
> #ifdef HAVE_SYS_MMAN_H
> #  include 
> #endif
> -#ifdef HAVE_MALLOC_H
> -
>  # 
> include 
> 
> +/*  is only needed if we don't have posix_memalign() */
> +#if defined(hwloc_getpagesize) && !defined(HAVE_POSIX_MEMALIGN) && 
> defined(HAVE_MEMALIGN) && defined(HAVE_MALLOC_H)
> +#include 
> #endif
> #ifdef HAVE_UNISTD_H
> #include 
> 
> ___
> hwloc-devel mailing list
> hwloc-de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: OB1 optimizations

2014-01-10 Thread Nathan Hjelm


Looks like it slowed down by about 20ns from the original patch. That is
to be expected when latencies are this low. Results for the following
are attached:

 - Trunk r30215 sm and vader results for osu_latency.
 - Trunk r30215 + patch take3 for both sm and vader.
 - Trunk r30215 + patch + forced 16 byte match header for vader.

The last one is not completely surprising. The current match header is
14 bytes which means the memcpy for the data is not aligned for a 64-bit
architecture. Might be worth looking at bumping the match header size up
as another optimization.

-Nathan

On Fri, Jan 10, 2014 at 02:24:19PM +0100, George Bosilca wrote:
> Nathan,
> 
> When you get access to the machine it might be interesting to show not only 
> the after-patch performance but also what the trunk is getting on the same 
> architecture.
> 
>   George.
> 
> On Jan 8, 2014, at 18:09 , Nathan Hjelm  wrote:
> 
> > Yeah. Its hard to say what the results will look like on Haswell. I
> > expect they should show some improvement from George's change but we
> > won't know until I can get to a Haswell node. Hopefully one becomes
> > available today.
> > 
> > -Nathan
> > 
> > On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
> >>   Nevermind, since Nathan just clarified that the results are not
> >>   comparable.
> >> 
> >>   -Paul [Sent from my phone]
> >> 
> >>   On Jan 8, 2014 8:58 AM, "Paul Hargrove"  wrote:
> >> 
> >> Interestingly enough the 4MB latency actually improved significantly
> >> relative to the initial numbers.
> >> 
> >> -Paul [Sent from my phone]
> >> 
> >> On Jan 8, 2014 8:50 AM, "George Bosilca"  wrote:
> >> 
> >>   These results are way worst that the one you send on your previous
> >>   email? What is the reason?
> >> 
> >> George.
> >> 
> >>   On Jan 8, 2014, at 17:33 , Nathan Hjelm  wrote:
> >> 
> >>> Ah, good catch. A new version is attached that should eliminate the
> >>   race
> >>> window for the multi-threaded case. Performance numbers are still
> >>> looking really good. We beat mvapich2 in the small message ping-pong
> >>   by
> >>> a good margin. See the results below. The large message latency
> >>> difference for large messages is probably due to a difference in the
> >>   max
> >>> send size for vader vs mvapich.
> >>> 
> >>> To answer Pasha's question. I don't see a noticiable difference in
> >>> performance for btl's with no sendi function (this includes
> >>> ugni). OpenIB should get a boost. I will test that once I get an
> >>> allocation.
> >>> 
> >>> CPU: Xeon E5-2670 @ 2.60 GHz
> >>> 
> >>> Open MPI (-mca btl vader,self):
> >>> # OSU MPI Latency Test v4.1
> >>> # Size  Latency (us)
> >>> 0   0.17
> >>> 1   0.19
> >>> 2   0.19
> >>> 4   0.19
> >>> 8   0.19
> >>> 16  0.19
> >>> 32  0.19
> >>> 64  0.40
> >>> 128 0.40
> >>> 256 0.43
> >>> 512 0.52
> >>> 10240.67
> >>> 20480.94
> >>> 40961.44
> >>> 81922.04
> >>> 16384   3.47
> >>> 32768   6.10
> >>> 65536   9.38
> >>> 131072 16.47
> >>> 262144 29.63
> >>> 524288 54.81
> >>> 1048576   106.63
> >>> 2097152   206.84
> >>> 4194304   421.26
> >>> 
> >>> 
> >>> mvapich2 1.9:
> >>> # OSU MPI Latency Test
> >>> # SizeLatency (us)
> >>> 0 0.23
> >>> 1 0.23
> >>> 2 0.23
> >>> 4 0.23
> >>> 8 0.23
> >>> 160.28
> >>> 320.28
> >>> 640.39
> >>> 128   0.40
> >>> 256   0.40
> >>> 512   0.42
> >>> 1024  0.51
> >>> 2048  0.71
> >>> 4096  1.02
> >>> 8192  1.60
> >>> 16384 3.47
> >>> 32768 5.05
> >>> 65536 8.06
> >>> 131072   14.82
> >>> 262144   28.15
> >>> 524288   53.69
> >>> 1048576 127.47
> >>> 2097152 235.58
> >>> 4194304 683.90
> >>> 
> >>> 
> >>> -Nathan
> >>> 
> >>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
>   The local request is not correctly released, leading to assert in
> >>   debug
>   mode. This is because you avoid calling
> >>   MCA_PML_BASE_RECV_REQUEST_FINI,
>   fact that leaves the request in an ACTIVE state, condition
> >>   carefully
>

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 10:57 AM, George Bosilca  wrote:

> This is not the same example as before. This example is correct, it does not 
> rely on the send being eagerly completed.

I know.  :-)

Just to tie up this thread for the web archives:

>> My point (which I guess I didn't make well) is that COMM_FREE is collective, 
>> which we all know does not necessarily mean synchronizing.  If hcoll 
>> teardown is going to add synchronization, there could be situations that 
>> might be dangerous (if OMPI doesn't already synchronize during COMM_FREE, 
>> which is why I asked the question).


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[hwloc-devel] Use of

2014-01-10 Thread Jeff Squyres (jsquyres)

Brice / Samuel --

In http://www.open-mpi.org/community/lists/devel/2014/01/13619.php, Paul 
Hargrove found this compiler warning:

-
On OpenBSD the header malloc.h exists, but is NOT intended to be used:
-bash-4.2$ grep -B2 'is obsolete' make.log 
  CC   bind.lo
In file included from 
/home/phargrov/OMPI/openmpi-1.7-latest-openbsd5-amd64/openmpi-1.7.4rc2r30168/opal/mca/hwloc/hwloc172/hwloc/src/bind.c:17:
/usr/include/malloc.h:4:2: warning: #warning " is obsolete, use 
"
-

What do you think of this patch (or something like it)?

diff --git a/src/bind.c b/src/bind.c
index 046b7cf..37921bc 100644
--- a/src/bind.c
+++ b/src/bind.c
@@ -13,8 +13,9 @@
 #ifdef HAVE_SYS_MMAN_H
 #  include 
 #endif
-#ifdef HAVE_MALLOC_H
-#  include 
+/*  is only needed if we don't have posix_memalign() */
+#if defined(hwloc_getpagesize) && !defined(HAVE_POSIX_MEMALIGN) && 
defined(HAVE_MEMALIGN) && defined(HAVE_MALLOC_H)
+#include 
 #endif
 #ifdef HAVE_UNISTD_H
 #include 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - PGI F08 failure

2014-01-10 Thread Jeff Squyres (jsquyres)

Paul --

The output from configure looks ok to me.  We're testing for the various 
capabilities of the fortran compiler that we need, most of which have been 
around for quite a while.  One Big New Thing that isn't around yet is the 
type(*), dimension(..) notation, which no fortran compiler supports yet.  But 
*most* of the other new MPI-3 Fortran behavior has been around since F2003 
(maybe earlier? I'm no expert).  (I glossed over a few details here, but you 
get the point)  

Hence, it's not entirely surprising to me that we're determining that an "old" 
compiler is ok to build the mpi_f08 module.

Can you send the output from what happens when you try to build?  (or did I 
miss that in another post?)


On Jan 9, 2014, at 7:35 PM, Paul Hargrove  wrote:

> My attempts to build the current 1.7.4rc tarball with versions 8.0 and 9.0 of 
> the Portland Group compilers fails miserably on compilation of 
> mpi-f08-types.F90.
> 
> I am sort of surprised by the attempt to build Fortran 2008 support w/ such 
> old compilers.
> I think that fact itself is the real bug here, right? 
> 
> With pgi-10.0 I see configure say:
> checking if building Fortran 'use mpi' bindings... yes
> checking if building Fortran 'use mpi_f08' bindings... no
> 
> But pgi-8.0 and 9.0 both get identified as "good" compilers.
> 
> pgi-9.0:
> checking if building Fortran 'use mpi' bindings... yes
> checking if Fortran compiler supports BIND(C)... yes
> checking if Fortran compiler supports BIND(C) with LOGICAL params... yes
> checking if Fortran compiler supports optional arguments... yes
> checking if Fortran compiler supports private... no
> checking if Fortran compiler supports abstract... yes
> checking if Fortran compiler supports asynchronous... no
> checking if Fortran compiler supports procedure... no
> checking size of Fortran type(test_mpi_handle)... 4
> checking Fortran compiler F08 assumed rank syntax... not cached; checking
> checking for Fortran compiler support of TYPE(*), DIMENSION(..)... no
> checking Fortran compiler F08 assumed rank syntax... no
> checking which mpi_f08 implementation to build... "good" compiler, no array 
> subsections
> configure: WARNING: Temporary development override: forcing the use of F08 
> wrappers
> checking if building Fortran 'use mpi_f08' bindings... yes
> 
> pgi-8.0 (almost, but not quite, the same):
> checking if building Fortran 'use mpi' bindings... yes
> checking if Fortran compiler supports BIND(C)... yes
> checking if Fortran compiler supports BIND(C) with LOGICAL params... yes
> checking if Fortran compiler supports optional arguments... yes
> checking if Fortran compiler supports private... no
> checking if Fortran compiler supports abstract... no
> checking if Fortran compiler supports asynchronous... no
> checking if Fortran compiler supports procedure... no
> checking size of Fortran type(test_mpi_handle)... 4
> checking Fortran compiler F08 assumed rank syntax... not cached; checking
> checking for Fortran compiler support of TYPE(*), DIMENSION(..)... no
> checking Fortran compiler F08 assumed rank syntax... no
> checking which mpi_f08 implementation to build... "good" compiler, no array 
> subsections
> configure: WARNING: Temporary development override: forcing the use of F08 
> wrappers
> checking if building Fortran 'use mpi_f08' bindings... yes
> 
> The bzip2-compressed config.log files for pgi-8.0 and 9.0 are attached.
> 
> -Paul 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - usnic warning w/ icc

2014-01-10 Thread Jeff Squyres (jsquyres)

Fixed; thanks.

On Jan 9, 2014, at 8:27 PM, Paul Hargrove  wrote:

> 
> I believe the following means that the compiler has determined that the two 
> named variables DO NOT actually get initialized to NULL as written.  However, 
> it looks like their initialization is not required, as each is set before it 
> is read.
> 
>   CC   btl_usnic_component.lo
> /scratch/scratchdirs/hargrove/OMPI/openmpi-1.7-latest-linux-x86_64-icc-13/openmpi-1.7.4rc2r30168/ompi/mca/btl/usnic/btl_usnic_component.c(1391):
>  warning #589: transfer of control bypasses initialization of:
> variable "ssfrag" (declared at line 1392)
> variable "lsfrag" (declared at line 1393)
>   switch (frag->uf_type) {
>   ^
> 
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] 1.7.4

2014-01-10 Thread Ralph Castain

Hi folks

If you've been following all the email on this list, you know that we are still 
working on resolving portability issues with 1.7.4. We obviously will not meet 
our milestone of releasing it today :-(

I'm hoping the delay will only last a week, and thus won't impact 1.7.5 too 
much. The oshmem code is the only major code change in that cycle, and we are 
wringing some of those problems out now as people test the trunk as well as 
1.7.4, so hopefully that will be able to flow a little faster.

Anyway, just an update. Appreciate everyone's help on shaking down 1.7.4
Ralph

Re: [OMPI devel] shared lib version on trunk

2014-01-10 Thread marco atzeri


Il 1/10/2014 3:50 PM, Jeff Squyres (jsquyres) ha scritto:

On Jan 10, 2014, at 9:48 AM, marco atzeri <> wrote:


building openmpi-1.9a1r30128-1, I notice
-
# Version information for libmpi.
current=0
age=0
revision=0
--

while on 1.7.3 is
--
# Version information for libmpi.
current=3
age=2
revision=0
--

Is this intentional ?


Yes.  We keep it 0/0/0 on the trunk (since the trunk is for developers only, we 
don't really need to care about ABI issues there), and only update the versions 
on the release branch more-or-less immediately before individual releases.



nice to know, I was wondering if something else went wrong.

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 10:04 AM, George Bosilca  wrote:

>> MPI Comm comm;
>> // comm is setup as an hcoll-enabled communicator
>> if (rank == x) {
>>   MPI_Send(..., y, tag, MPI_COMM_WORLD);
>>   MPI_Comm_free(comm);
>> } else if (rank == y) {
>>   MPI_Comm_free(comm);
>>   MPI_Recv(..., x, tag, MPI_COMM_WORLD);
>> }
> 
> Based on today’s MPI standard this code is incorrect as the MPI_Comm_free is 
> collective, and you can’t have matching blocking communications crossing a 
> collective line.

I don't know exactly what you mean by "crossing a collective line", but 
communicating in different communicators while a different collective is 
occurring is certainly valid.  I.e., this is valid (and won't deadlock):

-
MPI Comm comm;
// comm is setup as an hcoll-enabled communicator
MPI_Barrier(comm);
if (rank == x) {
  MPI_Send(..., y, tag, MPI_COMM_WORLD);
  MPI_Comm_free(comm);
} else if (rank == y) {
  MPI_Recv(..., x, tag, MPI_COMM_WORLD);
  MPI_Comm_free(comm);
} else {
  MPI_Comm_free(comm);
}
-

My point (which I guess I didn't make well) is that COMM_FREE is collective, 
which we all know does not necessarily mean synchronizing.  If hcoll teardown 
is going to add synchronization, there could be situations that might be 
dangerous (if OMPI doesn't already synchronize during COMM_FREE, which is why I 
asked the question).

Sorry if I just muddled the conversation...  :-\

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] 1.7.4rc2r30168 - configure failure on Mac OSX 10.5

2014-01-10 Thread Ralph Castain

And we do appreciate your breakage! :-)

I think we'll just drop 10.5 from the list as that's very old and likely not 
worth fixing


On Jan 9, 2014, at 4:50 PM, Paul Hargrove  wrote:

> Ralph,
> 
> I can build fine on 10.7 (the system I am typing on now), and on 10.6 too.
> 
> I have no strong opinion on fix-vs-document, but as Jeff knows quite well if 
> you say you support it I am going to try to make it break :).
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 4:46 PM, Ralph Castain  wrote:
> I dunno if we really go back that far, Paul - I doubt anyone has tested on 
> anything less than 10.8, frankly. Might be better if we update to not make 
> claims that far back.
> 
> Were you able to build/run on 10.7?
> 
> On Jan 9, 2014, at 3:25 PM, Paul Hargrove  wrote:
> 
>> As I noted in another email, 1.7.4's README claims support for Mac OSX 
>> versions 10.5 through 10.7.  So, I just now tried (but failed) to build on 
>> 10.5 (Leopard):
>> 
>> *** Assembler
>> checking dependency style of gcc -std=gnu99... gcc3
>> checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -p
>> checking the name lister (/usr/bin/nm -p) interface... BSD nm
>> checking for fgrep... /usr/bin/grep -F
>> checking if need to remove -g from CCASFLAGS... OS X Leopard - yes ( -O3 
>> -DNDEBUG -finline-functions -fno-strict-aliasing)
>> checking whether to enable smp locks... yes
>> checking if .proc/endp is needed... no
>> checking directive for setting text section... .text
>> checking directive for exporting symbols... .globl
>> checking for objdump... no
>> checking if .note.GNU-stack is needed... no
>> checking suffix for labels... :
>> checking prefix for global symbol labels... none
>> configure: error: Could not determine global symbol label prefix
>> 
>> The same failure is seen on a PPC system running OSX Leopard, too.  However, 
>> I figure it best to focus on getting x86 working first before worrying any 
>> about PPC.
>> 
>> The only configure option used was --prefix.
>> The bzip2-compressed config.log is attached.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain

Very strange. Try adding "-mca grpcomm_base_verbose 5 -mca orte_nidmap_verbose 
10" to your cmd line with the trunk version and let's see what may be happening

Any chance of library confusion here?

On Jan 9, 2014, at 9:57 PM, Paul Hargrove  wrote:

> The problem is seen with both the trunk and the 1.7.4rc tarball.
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 9:23 PM, Paul Hargrove  wrote:
> 
> On Thu, Jan 9, 2014 at 8:56 PM, Paul Hargrove  wrote:
> I'll try a gcc-based build on one of the systems ASAP.
> 
> Sorry, Ralph:  the failure remains when built w/ gcc.
> Let me know what to try next and I'll give it a shot.
> 
> -Paul
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread George Bosilca


On Jan 10, 2014, at 15:55 , Jeff Squyres (jsquyres)  wrote:

> On Jan 10, 2014, at 9:49 AM, George Bosilca  wrote:
> 
>> As I said, this is the case today. There are ongoing discussion in the MPI 
>> Forum to relax the wording of the MPI_Comm_free as most of the MPI 
>> implementations do not rely on the strict “collective” behavior of the 
>> MPI_Comm_free (in the sense that it has to be called by all processes but 
>> not necessarily in same time).
> 
> That will be an interesting discussion.  I look forward to your proposal.  :-)

? We already had this discussion in the context of another proposal. Anyway 
that’s an MPI Forum issue.

>>> I still agree with this point, though — even though COMM_FREE is 
>>> collective, you could still get into ordering / deadlock issues if you're 
>>> (effectively) doing communication inside it.
>> 
>> As long as the call is collective and the same attributes exists on all 
>> communicators I don’t see how the deadlock is possible. My wording was more 
>> a precaution for the future than a restriction for today.
> Here's an example:
> 
> -
> MPI Comm comm;
> // comm is setup as an hcoll-enabled communicator
> if (rank == x) {
>MPI_Send(..., y, tag, MPI_COMM_WORLD);
>MPI_Comm_free(comm);
> } else if (rank == y) {
>MPI_Comm_free(comm);
>MPI_Recv(..., x, tag, MPI_COMM_WORLD);
> }
> --
> 
> If the hcoll teardown in the COMM_FREE blocks waiting for all of its peer 
> COMM_FREEs in other processes in the communicator (e.g., due to blocking 
> communication), rank x may block in MPI_SEND waiting for rank y’s MPI_RECV, 
> and therefore never invoke its COMM_FREE.

Based on today’s MPI standard this code is incorrect as the MPI_Comm_free is 
collective, and you can’t have matching blocking communications crossing a 
collective line.

  George.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trunk - build failure on OpenBSD

2014-01-10 Thread Jeff Squyres (jsquyres)

This looks like how we handled this issue elsewhere in the OMPI code base, too.

Mellanox: in the interest of getting another good tarball today, since it's the 
weekend for you, I'll apply this patch.

(thanks Paul!)


On Jan 10, 2014, at 2:20 AM, Paul Hargrove  wrote:

> Based on how MAP_ANONYMOUS vs MAP_ANON is dealt with in 
> opal/mca/memory/linux/malloc.c,  I believe the patch below is an appropriate 
> solution for this issue.  Additionally, it handles the possibility that 
> MAP_FAILED is not defined (not sure where that comes up, but 
> opal/mca/memory/linux/malloc.c allows for it).
> 
> -Paul
> 
> Index: oshmem/mca/memheap/base/memheap_base_alloc.c
> ===
> --- oshmem/mca/memheap/base/memheap_base_alloc.c(revision 30223)
> +++ oshmem/mca/memheap/base/memheap_base_alloc.c(working copy)
> @@ -18,6 +18,12 @@
>  #ifdef HAVE_SYS_MMAN_H
>  #include 
>  #endif
> +#if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
> +# define MAP_ANONYMOUS MAP_ANON
> +#endif
> +#if !defined(MAP_FAILED)
> +# define MAP_FAILED ((char*)-1)
> +#endif
>  
>  #include 
>  #include 
> @@ -278,10 +284,8 @@
>  size,
>  PROT_READ | PROT_WRITE,
>  MAP_SHARED |
> -#if defined (__APPLE__)
> -MAP_ANON |
> -#elif defined (__GNUC__)
> -MAP_ANONYMOUS |
> +#ifdef MAP_ANONYMOUS
> +MAP_ANONYMOUS |
>  #endif
>  MAP_FIXED,
>  0,
> 
> 
> 
> 
> On Thu, Jan 9, 2014 at 8:35 PM, Paul Hargrove  wrote:
> Same issue for NetBSD, too.
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 7:09 PM, Paul Hargrove  wrote:
> With the new opal/util/path.c I get farther building the trunk on OpenBSD but 
> hit a new failure:
> 
> Making all in mca/memheap
>   CC   base/memheap_base_frame.lo
>   CC   base/memheap_base_select.lo
>   CC   base/memheap_base_alloc.lo
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:
>  In function '_mmap_attach':
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>  error: 'MAP_ANONYMOUS' undeclared (first use in this function)
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>  error: (Each undeclared identifier is reported only once
> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>  error: for each function it appears in.)
> *** Error 1 in oshmem/mca/memheap (Makefile:1631 
> 'base/memheap_base_alloc.lo': @echo "  CC  " 
> base/memheap_base_alloc.lo;depbase=`echo b...)
> *** Error 1 in oshmem (Makefile:1962 'all-recursive')
> *** Error 1 in /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/BLD 
> (Makefile:1685 'all-recursive')
> 
> On OpenBSD one must use MAP_ANON rather than MAP_ANONYMOUS.
> 
> -Paul
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 9:49 AM, George Bosilca  wrote:

> As I said, this is the case today. There are ongoing discussion in the MPI 
> Forum to relax the wording of the MPI_Comm_free as most of the MPI 
> implementations do not rely on the strict “collective” behavior of the 
> MPI_Comm_free (in the sense that it has to be called by all processes but not 
> necessarily in same time).

That will be an interesting discussion.  I look forward to your proposal.  :-)

>> I still agree with this point, though — even though COMM_FREE is collective, 
>> you could still get into ordering / deadlock issues if you're (effectively) 
>> doing communication inside it.
> 
> As long as the call is collective and the same attributes exists on all 
> communicators I don’t see how the deadlock is possible. My wording was more a 
> precaution for the future than a restriction for today.


Here's an example:

-
MPI Comm comm;
// comm is setup as an hcoll-enabled communicator
if (rank == x) {
MPI_Send(..., y, tag, MPI_COMM_WORLD);
MPI_Comm_free(comm);
} else if (rank == y) {
MPI_Comm_free(comm);
MPI_Recv(..., x, tag, MPI_COMM_WORLD);
}
--

If the hcoll teardown in the COMM_FREE blocks waiting for all of its peer 
COMM_FREEs in other processes in the communicator (e.g., due to blocking 
communication), rank x may block in MPI_SEND waiting for rank y's MPI_RECV, and 
therefore never invoke its COMM_FREE.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] shared lib version on trunk

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 9:48 AM, marco atzeri  wrote:

> building openmpi-1.9a1r30128-1, I notice
> -
> # Version information for libmpi.
> current=0
> age=0
> revision=0
> --
> 
> while on 1.7.3 is
> --
> # Version information for libmpi.
> current=3
> age=2
> revision=0
> --
> 
> Is this intentional ?

Yes.  We keep it 0/0/0 on the trunk (since the trunk is for developers only, we 
don't really need to care about ABI issues there), and only update the versions 
on the release branch more-or-less immediately before individual releases.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread George Bosilca

On Jan 10, 2014, at 15:31 , Jeff Squyres (jsquyres)  wrote:

> On Jan 10, 2014, at 9:19 AM, George Bosilca  wrote:
> 
>> However, one should keep in mind that MPI_Comm_free does not have to be a 
>> collective function, thus making any type of collective 
>> assumption/communications inside the attribute destructor might lead to 
>> deadlocks in future versions.
> 
> Actually, MPI-3 defines MPI_COMM_FREE as collective (p248:23).

As I said, this is the case today. There are ongoing discussion in the MPI 
Forum to relax the wording of the MPI_Comm_free as most of the MPI 
implementations do not rely on the strict “collective” behavior of the 
MPI_Comm_free (in the sense that it has to be called by all processes but not 
necessarily in same time).

>> In other words if the only thing you do in the attribute descriptor is 
>> tearing down locally posted requests, then you are safe. If you send data 
>> using the communicator then you’re definitively playing dangerously with the 
>> safety line.
> 
> I still agree with this point, though — even though COMM_FREE is collective, 
> you could still get into ordering / deadlock issues if you're (effectively) 
> doing communication inside it.

As long as the call is collective and the same attributes exists on all 
communicators I don’t see how the deadlock is possible. My wording was more a 
precaution for the future than a restriction for today.

  George.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

[OMPI devel] shared lib version on trunk

2014-01-10 Thread marco atzeri


building openmpi-1.9a1r30128-1, I notice

-
# Version information for libmpi.
current=0
age=0
revision=0
--

while on 1.7.3 is
--
# Version information for libmpi.
current=3
age=2
revision=0
--


Is this intentional ?

Regards
Marco

Re: [OMPI devel] trunk build failure on {Free,Net,Open}BSD

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 9:18 AM, "Jeff Squyres (jsquyres)"  
wrote:

>> It seems to indicate that even if one does find a statfs() function, there 
>> are multiple os-dependent versions and it should therefore be avoided.  
>> Since statvfs() is defined by POSIX, it should be preferred.
> 
> Sounds good; I'll do that.

Gah.  The situation gets murkier.  I see in OS X Mountain Lion and Mavericks 
man pages for statvfs() where they describe the fields in struct statvfs:

   f_fsid Not meaningful in this implementation.

This is the field I need out of struct statvfs to know what the file system 
magic number is.  Arrgh!

I'll keep looking into what would be a good solution here...

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 10, 2014, at 9:19 AM, George Bosilca  wrote:

> However, one should keep in mind that MPI_Comm_free does not have to be a 
> collective function, thus making any type of collective 
> assumption/communications inside the attribute destructor might lead to 
> deadlocks in future versions.

Actually, MPI-3 defines MPI_COMM_FREE as collective (p248:23).

> In other words if the only thing you do in the attribute descriptor is 
> tearing down locally posted requests, then you are safe. If you send data 
> using the communicator then you’re definitively playing dangerously with the 
> safety line.

I still agree with this point, though -- even though COMM_FREE is collective, 
you could still get into ordering / deadlock issues if you're (effectively) 
doing communication inside it.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread George Bosilca

On Jan 10, 2014, at 14:50 , Jeff Squyres (jsquyres)  wrote:

> On Jan 9, 2014, at 12:05 PM, Joshua Ladd  wrote:
> 
>> [Josh] We have a recursive doubling algorithm in progress implemented with 
>> PML send/recvs, more accurately , with "RTE_isend/RTE_irecv" functions, 
>> which, in the case of OMPI are PML calls.
> 
> Does that mean that you’ll be blocking (effectively) in the communicator 
> destruction function?

I’m not sure I understand what you call the “communicator destruction 
function”. I can see two options here: user perspective (MPI_Comm_free) or ompi 
perspective (the communicator destructor). As I explained in my previous email 
if they post requests on the communicator then the communicator destructor will 
never be called before they cancel their pending requests. Thus, it is critical 
that they cleanup their internal stuff as early as possible in the 
MPI_Comm_free tear down sequence, and here the attribute is a perfect approach.

> I *think* that's ok, but I'm not 100% sure... Brian / George / Nathan: can 
> you confirm?
> 
> I ask because the standard does not specify what is allowed in attribute 
> callback functions — which, by omission, means that *everything* is allowed, 
> but I don't know how well tested code paths are that invoke arbitrary MPI 
> (PML) functionality inside communicator teardown.

From the perspective of the MPI 3.0 standard and the current code of Open MPI, 
this approach is perfectly legal and should work.

However, one should keep in mind that MPI_Comm_free does not have to be a 
collective function, thus making any type of collective 
assumption/communications inside the attribute destructor might lead to 
deadlocks in future versions. In other words if the only thing you do in the 
attribute descriptor is tearing down locally posted requests, then you are 
safe. If you send data using the communicator then you’re definitively playing 
dangerously with the safety line.

  George.

> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to: 
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] trunk build failure on {Free,Net,Open}BSD

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 9, 2014, at 11:00 PM, Paul Hargrove  wrote:

> The following might be helpful:
>   
> http://stackoverflow.com/questions/1653163/difference-between-statvfs-and-statfs-system-calls
> 
> It seems to indicate that even if one does find a statfs() function, there 
> are multiple os-dependent versions and it should therefore be avoided.  Since 
> statvfs() is defined by POSIX, it should be preferred.

Sounds good; I'll do that.

> If I am not mistaken, reordering the #if logic in path.c to use *only* 
> statvfs() when it is available (and *not* trying both as is done now) would 
> resolve the problems I am seeing with NetBSD and Solaris WITHOUT any need to 
> change the configure logic.  However, if one does want to keep the current 
> logic (or at least something similar) it looks like configure should not 
> assume statfs() is available without *also* confirming that "struct statfs" 
> is available.
> 
> -Paul
> 
> 
> 
> 
> On Thu, Jan 9, 2014 at 7:18 PM, Paul Hargrove  wrote:
> 
> On Thu, Jan 9, 2014 at 7:15 PM, Paul Hargrove  wrote:
> My Solaris-11 build stopped again on the failure to find ibv_open_device().
> I am re-running w/o --enable-openib now.
> 
> It finished while I was typing the previous message.
> The Solaris-11 build failed in the same way as Solaris-10.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] hcoll destruction via MPI attribute

2014-01-10 Thread Jeff Squyres (jsquyres)

On Jan 9, 2014, at 12:05 PM, Joshua Ladd  wrote:

> [Josh] We have a recursive doubling algorithm in progress implemented with 
> PML send/recvs, more accurately , with "RTE_isend/RTE_irecv" functions, 
> which, in the case of OMPI are PML calls.

Does that mean that you'll be blocking (effectively) in the communicator 
destruction function?

I *think* that's ok, but I'm not 100% sure... Brian / George / Nathan: can you 
confirm?

I ask because the standard does not specify what is allowed in attribute 
callback functions -- which, by omission, means that *everything* is allowed, 
but I don't know how well tested code paths are that invoke arbitrary MPI (PML) 
functionality inside communicator teardown.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

[OMPI devel] README / OS X versions (was: 1.7.4rc2r30168 - configure failure on Mac OSX 10.5)

2014-01-10 Thread Jeff Squyres (jsquyres)

For my own sanity:

10.9: Mavericks, last release Dec 2013
10.8: Mountain Lion, last release Oct 2013 (maybe not dead)
10.7: Lion, last release Oct 2012 (dead)
10.6: Snow Leopard, last release 2011 (dead)
10.5: Leopard, last release 2009 (dead)

I don't think we should expend any effort for 10.5; it's too old.  I don't 
think I care about 10.6, either, but if it still works, I guess there's no real 
reason to remove it.

So this is just a +1 on removing 10.5 from the README.


On Jan 9, 2014, at 7:50 PM, Paul Hargrove  wrote:

> Ralph,
> 
> I can build fine on 10.7 (the system I am typing on now), and on 10.6 too.
> 
> I have no strong opinion on fix-vs-document, but as Jeff knows quite well if 
> you say you support it I am going to try to make it break :).
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 4:46 PM, Ralph Castain  wrote:
> I dunno if we really go back that far, Paul - I doubt anyone has tested on 
> anything less than 10.8, frankly. Might be better if we update to not make 
> claims that far back.
> 
> Were you able to build/run on 10.7?
> 
> On Jan 9, 2014, at 3:25 PM, Paul Hargrove  wrote:
> 
>> As I noted in another email, 1.7.4's README claims support for Mac OSX 
>> versions 10.5 through 10.7.  So, I just now tried (but failed) to build on 
>> 10.5 (Leopard):
>> 
>> *** Assembler
>> checking dependency style of gcc -std=gnu99... gcc3
>> checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -p
>> checking the name lister (/usr/bin/nm -p) interface... BSD nm
>> checking for fgrep... /usr/bin/grep -F
>> checking if need to remove -g from CCASFLAGS... OS X Leopard - yes ( -O3 
>> -DNDEBUG -finline-functions -fno-strict-aliasing)
>> checking whether to enable smp locks... yes
>> checking if .proc/endp is needed... no
>> checking directive for setting text section... .text
>> checking directive for exporting symbols... .globl
>> checking for objdump... no
>> checking if .note.GNU-stack is needed... no
>> checking suffix for labels... :
>> checking prefix for global symbol labels... none
>> configure: error: Could not determine global symbol label prefix
>> 
>> The same failure is seen on a PPC system running OSX Leopard, too.  However, 
>> I figure it best to focus on getting x86 working first before worrying any 
>> about PPC.
>> 
>> The only configure option used was --prefix.
>> The bzip2-compressed config.log is attached.
>> 
>> -Paul
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] RFC: OB1 optimizations

2014-01-10 Thread George Bosilca

Nathan,

When you get access to the machine it might be interesting to show not only the 
after-patch performance but also what the trunk is getting on the same 
architecture.

  George.

On Jan 8, 2014, at 18:09 , Nathan Hjelm  wrote:

> Yeah. Its hard to say what the results will look like on Haswell. I
> expect they should show some improvement from George's change but we
> won't know until I can get to a Haswell node. Hopefully one becomes
> available today.
> 
> -Nathan
> 
> On Wed, Jan 08, 2014 at 08:59:34AM -0800, Paul Hargrove wrote:
>>   Nevermind, since Nathan just clarified that the results are not
>>   comparable.
>> 
>>   -Paul [Sent from my phone]
>> 
>>   On Jan 8, 2014 8:58 AM, "Paul Hargrove"  wrote:
>> 
>> Interestingly enough the 4MB latency actually improved significantly
>> relative to the initial numbers.
>> 
>> -Paul [Sent from my phone]
>> 
>> On Jan 8, 2014 8:50 AM, "George Bosilca"  wrote:
>> 
>>   These results are way worst that the one you send on your previous
>>   email? What is the reason?
>> 
>> George.
>> 
>>   On Jan 8, 2014, at 17:33 , Nathan Hjelm  wrote:
>> 
>>> Ah, good catch. A new version is attached that should eliminate the
>>   race
>>> window for the multi-threaded case. Performance numbers are still
>>> looking really good. We beat mvapich2 in the small message ping-pong
>>   by
>>> a good margin. See the results below. The large message latency
>>> difference for large messages is probably due to a difference in the
>>   max
>>> send size for vader vs mvapich.
>>> 
>>> To answer Pasha's question. I don't see a noticiable difference in
>>> performance for btl's with no sendi function (this includes
>>> ugni). OpenIB should get a boost. I will test that once I get an
>>> allocation.
>>> 
>>> CPU: Xeon E5-2670 @ 2.60 GHz
>>> 
>>> Open MPI (-mca btl vader,self):
>>> # OSU MPI Latency Test v4.1
>>> # Size  Latency (us)
>>> 0   0.17
>>> 1   0.19
>>> 2   0.19
>>> 4   0.19
>>> 8   0.19
>>> 16  0.19
>>> 32  0.19
>>> 64  0.40
>>> 128 0.40
>>> 256 0.43
>>> 512 0.52
>>> 10240.67
>>> 20480.94
>>> 40961.44
>>> 81922.04
>>> 16384   3.47
>>> 32768   6.10
>>> 65536   9.38
>>> 131072 16.47
>>> 262144 29.63
>>> 524288 54.81
>>> 1048576   106.63
>>> 2097152   206.84
>>> 4194304   421.26
>>> 
>>> 
>>> mvapich2 1.9:
>>> # OSU MPI Latency Test
>>> # SizeLatency (us)
>>> 0 0.23
>>> 1 0.23
>>> 2 0.23
>>> 4 0.23
>>> 8 0.23
>>> 160.28
>>> 320.28
>>> 640.39
>>> 128   0.40
>>> 256   0.40
>>> 512   0.42
>>> 1024  0.51
>>> 2048  0.71
>>> 4096  1.02
>>> 8192  1.60
>>> 16384 3.47
>>> 32768 5.05
>>> 65536 8.06
>>> 131072   14.82
>>> 262144   28.15
>>> 524288   53.69
>>> 1048576 127.47
>>> 2097152 235.58
>>> 4194304 683.90
>>> 
>>> 
>>> -Nathan
>>> 
>>> On Tue, Jan 07, 2014 at 06:23:13PM -0700, George Bosilca wrote:
  The local request is not correctly released, leading to assert in
>>   debug
  mode. This is because you avoid calling
>>   MCA_PML_BASE_RECV_REQUEST_FINI,
  fact that leaves the request in an ACTIVE state, condition
>>   carefully
  checked during the call to destructor.
 
  I attached a second patch that fixes the issue above, and
>>   implement a
  similar optimization for the blocking send.
 
  Unfortunately, this is not enough. The mca_pml_ob1_send_inline
  optimization is horribly wrong in a multithreaded case as it
>>   alter the
  send_sequence without storing it. If you create a gap in the
>>   send_sequence
  a deadlock will __definitively__ occur. I strongly suggest you
>>   turn off
  the mca_pml_ob1_send_inline optimization on the multithreaded
>>   case. All
  the others optimizations should be safe in all cases.
 
George.
 
  On Jan 8, 2014, at 01:15 , Shamis, Pavel 
>>   wrote:
 
> Overall it looks good. It would be helpful to validate performance
  numbers for other interconnects

Re: [OMPI devel] 1.7.4rc2r30148 run failure NetBSD6-x86

2014-01-10 Thread Mike Dubman

Hey Paul,
Thanks for report, we will commit fix shortly.
M


On Fri, Jan 10, 2014 at 7:20 AM, Paul Hargrove  wrote:

>
> On Thu, Jan 9, 2014 at 9:05 PM, Ralph Castain  wrote:
>
>> Not sure why the shmem fortran examples would try to build - will pass
>> that off to Jeff as well (sorry Jeff!)
>
>
> This is the issue I described in
> http://www.open-mpi.org/community/lists/devel/2014/01/13616.php
>
> It seems that oshmem_info always says "oshmem:bindings:fort:yes " even
> when there is no fortran compiler.
> I believe it is a configury issue, since the value comes from the value of
> an AM_CONDITIONAL.
>
>
> -Paul
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] trunk build failure on {Free,Net,Open}BSD

2014-01-10 Thread marco atzeri


Il 1/10/2014 5:00 AM, Paul Hargrove ha scritto:

The following might be helpful:
http://stackoverflow.com/questions/1653163/difference-between-statvfs-and-statfs-system-calls

It seems to indicate that even if one does find a statfs() function,
there are multiple os-dependent versions and it should therefore be
avoided.  Since statvfs() is defined by POSIX, it should be preferred.

If I am not mistaken, reordering the #if logic in path.c to use *only*
statvfs() when it is available (and *not* trying both as is done now)
would resolve the problems I am seeing with NetBSD and Solaris WITHOUT
any need to change the configure logic.  However, if one does want to
keep the current logic (or at least something similar) it looks like
configure should not assume statfs() is available without *also*
confirming that "struct statfs" is available.

-Paul



statvfs() is available on CYGWIN,
http://cygwin.com/cygwin-api/compatibility.html#std-susv4

so no issue to use it as default for me

Thanks
Marco

Re: [OMPI devel] trunk - build failure on OpenBSD

2014-01-10 Thread Paul Hargrove

Based on how MAP_ANONYMOUS vs MAP_ANON is dealt with
in opal/mca/memory/linux/malloc.c,  I believe the patch below is an
appropriate solution for this issue.  Additionally, it handles the
possibility that MAP_FAILED is not defined (not sure where that comes up,
but opal/mca/memory/linux/malloc.c allows for it).

-Paul

Index: oshmem/mca/memheap/base/memheap_base_alloc.c
===
--- oshmem/mca/memheap/base/memheap_base_alloc.c(revision 30223)
+++ oshmem/mca/memheap/base/memheap_base_alloc.c(working copy)
@@ -18,6 +18,12 @@
 #ifdef HAVE_SYS_MMAN_H
 #include 
 #endif
+#if !defined(MAP_ANONYMOUS) && defined(MAP_ANON)
+# define MAP_ANONYMOUS MAP_ANON
+#endif
+#if !defined(MAP_FAILED)
+# define MAP_FAILED ((char*)-1)
+#endif

 #include 
 #include 
@@ -278,10 +284,8 @@
 size,
 PROT_READ | PROT_WRITE,
 MAP_SHARED |
-#if defined (__APPLE__)
-MAP_ANON |
-#elif defined (__GNUC__)
-MAP_ANONYMOUS |
+#ifdef MAP_ANONYMOUS
+MAP_ANONYMOUS |
 #endif
 MAP_FIXED,
 0,




On Thu, Jan 9, 2014 at 8:35 PM, Paul Hargrove  wrote:

> Same issue for NetBSD, too.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 7:09 PM, Paul Hargrove  wrote:
>
>> With the new opal/util/path.c I get farther building the trunk on OpenBSD
>> but hit a new failure:
>>
>> Making all in mca/memheap
>>   CC   base/memheap_base_frame.lo
>>   CC   base/memheap_base_select.lo
>>   CC   base/memheap_base_alloc.lo
>> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:
>> In function '_mmap_attach':
>> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>> error: 'MAP_ANONYMOUS' undeclared (first use in this function)
>> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>> error: (Each undeclared identifier is reported only once
>> /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/openmpi-1.9a1r30215/oshmem/mca/memheap/base/memheap_base_alloc.c:284:
>> error: for each function it appears in.)
>> *** Error 1 in oshmem/mca/memheap (Makefile:1631
>> 'base/memheap_base_alloc.lo': @echo "  CC  "
>> base/memheap_base_alloc.lo;depbase=`echo b...)
>> *** Error 1 in oshmem (Makefile:1962 'all-recursive')
>> *** Error 1 in /home/phargrov/OMPI/openmpi-trunk-openbsd5-i386/BLD
>> (Makefile:1685 'all-recursive')
>>
>> On OpenBSD one must use MAP_ANON rather than MAP_ANONYMOUS.
>>
>> -Paul
>>
>>
>> --
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>>
>
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

The problem is seen with both the trunk and the 1.7.4rc tarball.

-Paul


On Thu, Jan 9, 2014 at 9:23 PM, Paul Hargrove  wrote:

>
> On Thu, Jan 9, 2014 at 8:56 PM, Paul Hargrove  wrote:
>
>> I'll try a gcc-based build on one of the systems ASAP.
>
>
> Sorry, Ralph:  the failure remains when built w/ gcc.
> Let me know what to try next and I'll give it a shot.
>
> -Paul
>
>
> --
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>



-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Paul Hargrove

On Thu, Jan 9, 2014 at 8:56 PM, Paul Hargrove  wrote:

> I'll try a gcc-based build on one of the systems ASAP.


Sorry, Ralph:  the failure remains when built w/ gcc.
Let me know what to try next and I'll give it a shot.

-Paul


-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] 1.7.4rc2r30148 run failure NetBSD6-x86

2014-01-10 Thread Paul Hargrove

On Thu, Jan 9, 2014 at 9:05 PM, Ralph Castain  wrote:

> Not sure why the shmem fortran examples would try to build - will pass
> that off to Jeff as well (sorry Jeff!)

This is the issue I described in
http://www.open-mpi.org/community/lists/devel/2014/01/13616.php

It seems that oshmem_info always says "oshmem:bindings:fort:yes " even when
there is no fortran compiler.
I believe it is a configury issue, since the value comes from the value of
an AM_CONDITIONAL.

-Paul

-- 
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
Computer and Data Sciences Department Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900

Re: [OMPI devel] uDAPL and elan in 1.7.4?

2014-01-10 Thread Ralph Castain

Corrected - thanks!

On Jan 9, 2014, at 5:40 PM, Paul Hargrove  wrote:

> The README in the current 1.7.4rc tarball still claims support for uDAPL and 
> Quadrics Elan.  Unless I am mistaken, those were both removed.
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Portals vs Portals4 in 1.7.4

2014-01-10 Thread Ralph Castain

Corrected - thanks!

On Jan 9, 2014, at 5:45 PM, Paul Hargrove  wrote:

> The README in the current 1.7.4rc tarball lists support for "Portals" and 
> documents --with-portals{,-config,-libs} configure arguments.
> 
> However, unless I am mistaken mtl:portals is gone and mtl:portals4 has 
> different configure arguments.
> 
> -Paul
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7.4rc2r30148 run failure NetBSD6-x86

2014-01-10 Thread Ralph Castain

Really have to thank you for your persistence, Paul. Truly appreciated.

Glad to hear you can run ring_c. I'm going to let Jeff handle the path.c and 
ANON issue as those are both configury things he was working today.

Not sure why the shmem fortran examples would try to build - will pass that off 
to Jeff as well (sorry Jeff!)

On Jan 9, 2014, at 9:01 PM, Paul Hargrove  wrote:

> Ralph,
> 
> My NetBSD builds fail on the new opal/util/path.c, but by adding
>   #undef HAVE_STATFS
> near the top of path.c I can get past that.
> 
> Next I hit the MAP_ANON-vs-MAP_ANONYMOUS issue and fixed that manually.
> 
> Next I hit the attempt to build shmem fortran examples despite the lack of a 
> fortran compiler.
> 
> I *was* able to finally confirm that I can now run ring_c.
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 12:07 PM, Paul Hargrove  wrote:
> Ralph,
> 
> Thanks for fielding all these issues I've been finding.
> I will plan to run tonight's trunk tarball through all of the systems where 
> I've seen any issues.
> 
> -Paul
> 
> 
> On Thu, Jan 9, 2014 at 8:40 AM, Ralph Castain  wrote:
> Should now be fixed in trunk (silently fall back to not binding if cores not 
> found) - scheduled for 1.7.4. If you could test the next trunk tarball, that 
> would help as I can't actually test it on my machines
> 
> 
> On Jan 9, 2014, at 6:25 AM, Ralph Castain  wrote:
> 
>> I see the issue - there are no "cores" on this topology, only "pu's", so 
>> "bind-to core" is going to fail even though binding is supported. Will 
>> adjust.
>> 
>> Thanks!
>> 
>> On Jan 8, 2014, at 9:06 PM, Paul Hargrove  wrote:
>> 
>>> Requested verbose output below.
>>> -Paul
>>> 
>>> -bash-4.2$ mpirun -mca ess_base_verbose 10 -np 1 examples/ring_c
>>> [pcp-j-17:02150] mca: base: components_register: registering ess components
>>> [pcp-j-17:02150] mca: base: components_register: found loaded component env
>>> [pcp-j-17:02150] mca: base: components_register: component env has no 
>>> register or open function
>>> [pcp-j-17:02150] mca: base: components_register: found loaded component hnp
>>> [pcp-j-17:02150] mca: base: components_register: component hnp has no 
>>> register or open function
>>> [pcp-j-17:02150] mca: base: components_register: found loaded component 
>>> singleton
>>> [pcp-j-17:02150] mca: base: components_register: component singleton 
>>> register function successful
>>> [pcp-j-17:02150] mca: base: components_register: found loaded component tool
>>> [pcp-j-17:02150] mca: base: components_register: component tool has no 
>>> register or open function
>>> [pcp-j-17:02150] mca: base: components_open: opening ess components
>>> [pcp-j-17:02150] mca: base: components_open: found loaded component env
>>> [pcp-j-17:02150] mca: base: components_open: component env open function 
>>> successful
>>> [pcp-j-17:02150] mca: base: components_open: found loaded component hnp
>>> [pcp-j-17:02150] mca: base: components_open: component hnp open function 
>>> successful
>>> [pcp-j-17:02150] mca: base: components_open: found loaded component 
>>> singleton
>>> [pcp-j-17:02150] mca: base: components_open: component singleton open 
>>> function successful
>>> [pcp-j-17:02150] mca: base: components_open: found loaded component tool
>>> [pcp-j-17:02150] mca: base: components_open: component tool open function 
>>> successful
>>> [pcp-j-17:02150] mca:base:select: Auto-selecting ess components
>>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [env]
>>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [env]. Query 
>>> failed to return a module
>>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [hnp]
>>> [pcp-j-17:02150] mca:base:select:(  ess) Query of component [hnp] set 
>>> priority to 100
>>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [singleton]
>>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [singleton]. 
>>> Query failed to return a module
>>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [tool]
>>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [tool]. Query 
>>> failed to return a module
>>> [pcp-j-17:02150] mca:base:select:(  ess) Selected component [hnp]
>>> [pcp-j-17:02150] mca: base: close: component env closed
>>> [pcp-j-17:02150] mca: base: close: unloading component env
>>> [pcp-j-17:02150] mca: base: close: component singleton closed
>>> [pcp-j-17:02150] mca: base: close: unloading component singleton
>>> [pcp-j-17:02150] mca: base: close: component tool closed
>>> [pcp-j-17:02150] mca: base: close: unloading component tool
>>> [pcp-j-17:02150] [[INVALID],INVALID] Topology Info:
>>> [pcp-j-17:02150] Type: Machine Number of child objects: 2
>>> Name=NULL
>>> Backend=NetBSD
>>> OSName=NetBSD
>>> OSRelease=6.1
>>> OSVersion="NetBSD 6.1 (CUSTOM) #0: Fri Sep 20 13:19:58 PDT 2013 
>>>

Re: [OMPI devel] 1.7.4rc2r30168 - odd run failure

2014-01-10 Thread Ralph Castain

It's missing the hostname from the other process - should have been included in 
the data passed into each proc at startup, which is why it's so puzzling.

On Jan 9, 2014, at 8:56 PM, Paul Hargrove  wrote:

> Ralph,
> 
> The problem has occurred with two builds (both PGI-based) on head nodes of 
> two clusters managed by TORQUE, not by SLURM.  Somehow configure on the first 
> picked up SLURM headers and libs, but not TM.  While the second picked up the 
> TM headers and libs.
> 
> I'll try a gcc-based build on one of the systems ASAP.
> Is there no way (w/o source mods) to know what datum is missing?
> 
> -Paul
> 
> 
> 
> On Thu, Jan 9, 2014 at 8:35 PM, Ralph Castain  wrote:
> From your ompi_info output, it looks like this is a slurm system - yes? 
> Wouldn't really matter anyway as we run fine on a head node without an 
> allocation, but worth clarifying.
> 
> What the message is indicating is a failure of the modex - we are missing an 
> expected piece of data. I don't see anything obvious as the source of the 
> problem - works fine for me on all my machines, including on front end of a 
> slurm cluster.
> 
> Only possibly relevant thing I see is that this was built with PGI - any 
> chance you could try a gcc based build? All my tests are done with gcc, so 
> I'm wondering if PGI is the source of the trouble here.
> 
> 
> On Jan 9, 2014, at 6:17 PM, Paul Hargrove  wrote:
> 
>> I've now seen this same failure mode on another Linux system.
>> I forgot to mention before that the job is hung after issuing the error 
>> message.
>> Singleton runs fail in the same manner.
>> 
>> Both are front-end machines and perhaps that is related to this failure; for 
>> instance expecting an allocation because of the batch system detected at 
>> configure time.  However, I would have expected a more informative error 
>> message for that case.
>> 
>> -Paul
>> 
>> 
>> On Thu, Jan 9, 2014 at 5:03 PM, Paul Hargrove  wrote:
>> Trying to run on the front-end of one of our production Linux systems I see 
>> the following:
>> 
>> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>> [cvrsvc01:17692] [[42051,1],0] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>>  at line 505
>> [cvrsvc01:17693] [[42051,1],1] ORTE_ERROR_LOG: Data for specified key not 
>> found in file 
>> /global/homes/h/hargrove/GSCRATCH/OMPI/openmpi-1.7-latest-linux-x86_64-pgi-12.8/openmpi-1.7.4rc2r30168/orte/runtime/orte_globals.c
>>  at line 505
>> 
>> The "ompi_info --all" output is attached.
>> 
>> Please let me know what MCA param(s) to set to collect any additional info 
>> needed to track down the problem.
>> 
>> -Paul
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> 
>> 
>> 
>> -- 
>> Paul H. Hargrove  phhargr...@lbl.gov
>> Future Technologies Group
>> Computer and Data Sciences Department Tel: +1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> 
> -- 
> Paul H. Hargrove  phhargr...@lbl.gov
> Future Technologies Group
> Computer and Data Sciences Department Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory Fax: +1-510-486-6900
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.7.4rc2r30148 run failure NetBSD6-x86

2014-01-10 Thread Paul Hargrove

Ralph,

My NetBSD builds fail on the new opal/util/path.c, but by adding
  #undef HAVE_STATFS
near the top of path.c I can get past that.

Next I hit the MAP_ANON-vs-MAP_ANONYMOUS issue and fixed that manually.

Next I hit the attempt to build shmem fortran examples despite the lack of
a fortran compiler.

I *was* able to finally confirm that I can now run ring_c.

-Paul


On Thu, Jan 9, 2014 at 12:07 PM, Paul Hargrove  wrote:

> Ralph,
>
> Thanks for fielding all these issues I've been finding.
> I will plan to run tonight's trunk tarball through all of the systems
> where I've seen any issues.
>
> -Paul
>
>
> On Thu, Jan 9, 2014 at 8:40 AM, Ralph Castain  wrote:
>
>> Should now be fixed in trunk (silently fall back to not binding if cores
>> not found) - scheduled for 1.7.4. If you could test the next trunk tarball,
>> that would help as I can't actually test it on my machines
>>
>>
>> On Jan 9, 2014, at 6:25 AM, Ralph Castain  wrote:
>>
>> I see the issue - there are no "cores" on this topology, only "pu's", so
>> "bind-to core" is going to fail even though binding is supported. Will
>> adjust.
>>
>> Thanks!
>>
>> On Jan 8, 2014, at 9:06 PM, Paul Hargrove  wrote:
>>
>> Requested verbose output below.
>> -Paul
>>
>> -bash-4.2$ mpirun -mca ess_base_verbose 10 -np 1 examples/ring_c
>> [pcp-j-17:02150] mca: base: components_register: registering ess
>> components
>> [pcp-j-17:02150] mca: base: components_register: found loaded component
>> env
>> [pcp-j-17:02150] mca: base: components_register: component env has no
>> register or open function
>> [pcp-j-17:02150] mca: base: components_register: found loaded component
>> hnp
>> [pcp-j-17:02150] mca: base: components_register: component hnp has no
>> register or open function
>> [pcp-j-17:02150] mca: base: components_register: found loaded component
>> singleton
>> [pcp-j-17:02150] mca: base: components_register: component singleton
>> register function successful
>> [pcp-j-17:02150] mca: base: components_register: found loaded component
>> tool
>> [pcp-j-17:02150] mca: base: components_register: component tool has no
>> register or open function
>> [pcp-j-17:02150] mca: base: components_open: opening ess components
>> [pcp-j-17:02150] mca: base: components_open: found loaded component env
>> [pcp-j-17:02150] mca: base: components_open: component env open function
>> successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component hnp
>> [pcp-j-17:02150] mca: base: components_open: component hnp open function
>> successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component
>> singleton
>> [pcp-j-17:02150] mca: base: components_open: component singleton open
>> function successful
>> [pcp-j-17:02150] mca: base: components_open: found loaded component tool
>> [pcp-j-17:02150] mca: base: components_open: component tool open function
>> successful
>> [pcp-j-17:02150] mca:base:select: Auto-selecting ess components
>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [env]
>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [env]. Query
>> failed to return a module
>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [hnp]
>> [pcp-j-17:02150] mca:base:select:(  ess) Query of component [hnp] set
>> priority to 100
>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [singleton]
>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [singleton].
>> Query failed to return a module
>> [pcp-j-17:02150] mca:base:select:(  ess) Querying component [tool]
>> [pcp-j-17:02150] mca:base:select:(  ess) Skipping component [tool]. Query
>> failed to return a module
>> [pcp-j-17:02150] mca:base:select:(  ess) Selected component [hnp]
>> [pcp-j-17:02150] mca: base: close: component env closed
>> [pcp-j-17:02150] mca: base: close: unloading component env
>> [pcp-j-17:02150] mca: base: close: component singleton closed
>> [pcp-j-17:02150] mca: base: close: unloading component singleton
>> [pcp-j-17:02150] mca: base: close: component tool closed
>> [pcp-j-17:02150] mca: base: close: unloading component tool
>> [pcp-j-17:02150] [[INVALID],INVALID] Topology Info:
>> [pcp-j-17:02150] Type: Machine Number of child objects: 2
>> Name=NULL
>> Backend=NetBSD
>> OSName=NetBSD
>> OSRelease=6.1
>> OSVersion="NetBSD 6.1 (CUSTOM) #0: Fri Sep 20 13:19:58 PDT 2013
>> phargrov@pcp-j-17:/home/phargrov/CUSTOM"
>> Architecture=i386
>> Backend=x86
>> Cpuset:  0x0003
>> Online:  0x0003
>> Allowed: 0x0003
>> Bind CPU proc:   TRUE
>> Bind CPU thread: TRUE
>> Bind MEM proc:   FALSE
>> Bind MEM thread: FALSE
>> Type: PU Number of child objects: 0
>> Name=NULL
>> Cpuset:  0x0001
>> Online:  0x0001
>> Allowed: 0x0001
>> Type: PU Number

71 matches

Mail list logo