date:20070813

Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-13 Thread Dirk Eddelbuettel


(adding pkg-openmpi-maintain...@lists.alioth.debian.org which I should have
added earlier, sorry! --Dirk)

On 14 August 2007 at 00:08, Adrian Knoth wrote:
| On Mon, Aug 13, 2007 at 04:26:31PM -0500, Dirk Eddelbuettel wrote:
| 
| > > I'll now compile the 1.2.3 release tarball and see if I can reproduce
| 
| The 1.2.3 release also works fine:
| 
| adi@debian:~$ ./ompi123/bin/mpirun -np 2 ring
| 0: sending message (0) to 1
| 0: sent message
| 1: waiting for message
| 1: got message (1) from 0, sending to 0
| 0: got message (1) from 1

Now I'm even more confused. I though the bug was that it segfaulted when used
on a Debian-on-freebsd-kernel system ?

| adi@debian:~$ ./ompi123/bin/ompi_info 
| Open MPI: 1.2.3
|Open MPI SVN revision: r15136
| Open RTE: 1.2.3
|Open RTE SVN revision: r15136
| OPAL: 1.2.3
|OPAL SVN revision: r15136
|   Prefix: /home/adi/ompi123
|  Configured architecture: x86_64-unknown-kfreebsd6.2-gnu
| 
| > > the segfaults. On the other hand, I guess nobody is using OMPI on
| > > GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot
| > > would also fix the problem (think of "fixed in experimental").
| > Well, I generally prefer to follow upstream releases, and Jeff from the
| > upstream team echoed that. Let's wait for 1.2.4, shall we?
| 
| That's fine, v1.2 is the production release.
| 
| > | JFTR: It's currently not possible to compile OMPI on amd64 (out of the
| > | box). Though it compiles on i386
| > | 
| > |
http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw
| > | 
| > | it fails on amd64:
| > | 
| > |
http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw
| > | 
| > | stacktrace.c: In function 'opal_show_stackframe':
| > | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this
| > | function)
| > | stacktrace.c:145: error: (Each undeclared identifier is reported only
| > | once
| > | stacktrace.c:145: error: for each function it appears in.)
| > | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this
| > | function)
| > | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this
| > | function)
| > | make[4]: *** [stacktrace.lo] Error 1
| > | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util'
| > | 
| > | 
| > | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the
| > | relevant #define's are placed in an #ifdef __i386__ condition. After
| > | extending this for __x86_64__, everything works fine.
| > | 
| > | Should I file a bugreport against libc0.1-dev or will you take care?
| > I'm confused. What is libc0.1-dev?
| 
| 
|http://packages.debian.org/unstable/libdevel/libc0.1-dev
| 
| It's the "libc6-dev" for GNU/kFreeBSD, at least that's how I understand
| it.

I see, thanks.  Well if the bug is in the header files supplied by that
package, please go ahead and file a bug report.

| > Also note that I happened to have uploaded a third Debian revision of 1.2.3
| > yesterday, and that Debian release 1.2.3-3 built fine on amd as per:
| > 
| > 
http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log
| > 
| > So are we sure there's a bug?
| 
| Yes, absolutely. I was a little bit imprecise: with amd64, I ment
| kfreebsd-amd64, not Linux-amd64.

Ack.

| If you follow my two links and read their headlines, you can see that
| these are the buildlogs of 1.2.3-3 on kfreebsd, working for i386, but
| failing for amd64.
| 
| This is caused by "wrong" libc headers on kfreebsd, that's why I thought
| Uwe might want to have a look at it.

Ok. Back to the initial bug of Open MPI on Debian/kFreeBSD. What exactly is
the status now?

Thanks, Dirk


-- 
Three out of two people have difficulties with fractions.

Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-13 Thread Adrian Knoth

On Mon, Aug 13, 2007 at 04:26:31PM -0500, Dirk Eddelbuettel wrote:

> > I'll now compile the 1.2.3 release tarball and see if I can reproduce

The 1.2.3 release also works fine:

adi@debian:~$ ./ompi123/bin/mpirun -np 2 ring
0: sending message (0) to 1
0: sent message
1: waiting for message
1: got message (1) from 0, sending to 0
0: got message (1) from 1

adi@debian:~$ ./ompi123/bin/ompi_info 
Open MPI: 1.2.3
   Open MPI SVN revision: r15136
Open RTE: 1.2.3
   Open RTE SVN revision: r15136
OPAL: 1.2.3
   OPAL SVN revision: r15136
  Prefix: /home/adi/ompi123
 Configured architecture: x86_64-unknown-kfreebsd6.2-gnu

> > the segfaults. On the other hand, I guess nobody is using OMPI on
> > GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot
> > would also fix the problem (think of "fixed in experimental").
> Well, I generally prefer to follow upstream releases, and Jeff from the
> upstream team echoed that. Let's wait for 1.2.4, shall we?

That's fine, v1.2 is the production release.

> | JFTR: It's currently not possible to compile OMPI on amd64 (out of the
> | box). Though it compiles on i386
> | 
> |
> http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw
> | 
> | it fails on amd64:
> | 
> |
> http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw
> | 
> | stacktrace.c: In function 'opal_show_stackframe':
> | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this
> | function)
> | stacktrace.c:145: error: (Each undeclared identifier is reported only
> | once
> | stacktrace.c:145: error: for each function it appears in.)
> | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this
> | function)
> | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this
> | function)
> | make[4]: *** [stacktrace.lo] Error 1
> | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util'
> | 
> | 
> | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the
> | relevant #define's are placed in an #ifdef __i386__ condition. After
> | extending this for __x86_64__, everything works fine.
> | 
> | Should I file a bugreport against libc0.1-dev or will you take care?
> I'm confused. What is libc0.1-dev?


   http://packages.debian.org/unstable/libdevel/libc0.1-dev

It's the "libc6-dev" for GNU/kFreeBSD, at least that's how I understand
it.

> Also note that I happened to have uploaded a third Debian revision of 1.2.3
> yesterday, and that Debian release 1.2.3-3 built fine on amd as per:
> 
> http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log
> 
> So are we sure there's a bug?

Yes, absolutely. I was a little bit imprecise: with amd64, I ment
kfreebsd-amd64, not Linux-amd64.

If you follow my two links and read their headlines, you can see that
these are the buildlogs of 1.2.3-3 on kfreebsd, working for i386, but
failing for amd64.

This is caused by "wrong" libc headers on kfreebsd, that's why I thought
Uwe might want to have a look at it.


-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de

Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-13 Thread Dirk Eddelbuettel


Adrian,

On 13 August 2007 at 22:28, Adrian Knoth wrote:
| On Thu, Aug 02, 2007 at 10:51:13AM +0200, Adrian Knoth wrote:
| 
| > > We (as in the Debian maintainer for Open MPI) got this bug report from
| > > Uwe who sees mpi apps segfault on Debian systems with the FreeBSD
| > > kernel.
| > > Any input would be greatly appreciated!
| > I'll follow the QEMU instructions on your website and investigate on
| > my own ;)
| 
| I was able to get OMPI running on kfreebsd-amd64. I used a nightly
| snapshot from the trunk, so the problem is "more or less fixed by
| upstream" ;)
| 
| adi@debian:~$ ./ompi/bin/mpirun -np 2 ring
| 0: sending message (0) to 1
| 0: sent message
| 1: waiting for message
| 1: got message (1) from 0, sending to 0
| 0: got message (1) from 1
| 
| adi@debian:~$ ./ompi/bin/ompi_info 
| Open MPI: 1.3a1r15820
|Open MPI SVN revision: r15820
| Open RTE: 1.3a1r15820
|Open RTE SVN revision: r15820
| OPAL: 1.3a1r15820
|OPAL SVN revision: r15820
|   Prefix: /home/adi/ompi
|  Configured architecture: x86_64-unknown-kfreebsd6.2-gnu
| 
| 
| I'll now compile the 1.2.3 release tarball and see if I can reproduce

I really appreciate the help.

| the segfaults. On the other hand, I guess nobody is using OMPI on
| GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot
| would also fix the problem (think of "fixed in experimental").

Well, I generally prefer to follow upstream releases, and Jeff from the
upstream team echoed that. Let's wait for 1.2.4, shall we?

OTOH if you can back out a patch for 1.2.3, I'd apply that.

| 
| JFTR: It's currently not possible to compile OMPI on amd64 (out of the
| box). Though it compiles on i386
| 
|
http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw
| 
| it fails on amd64:
| 
|
http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw
| 
| stacktrace.c: In function 'opal_show_stackframe':
| stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this
| function)
| stacktrace.c:145: error: (Each undeclared identifier is reported only
| once
| stacktrace.c:145: error: for each function it appears in.)
| stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this
| function)
| stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this
| function)
| make[4]: *** [stacktrace.lo] Error 1
| make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util'
| 
| 
| This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the
| relevant #define's are placed in an #ifdef __i386__ condition. After
| extending this for __x86_64__, everything works fine.
| 
| Should I file a bugreport against libc0.1-dev or will you take care?

I'm confused. What is libc0.1-dev ?

Also note that I happened to have uploaded a third Debian revision of 1.2.3
yesterday, and that Debian release 1.2.3-3 built fine on amd as per:

http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log

So are we sure there's a bug?  Maybe you were just bitten by something in SVN
that is not yet deemed release quality?

| I'll keep you posted...

I appreciate that.

Cheers, Dirk

| -- 
| Cluster and Metacomputing Working Group
| Friedrich-Schiller-Universität Jena, Germany
| 
| private: http://adi.thur.de
| ___
| devel mailing list
| de...@open-mpi.org
| http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 
Three out of two people have difficulties with fractions.

Re: [OMPI devel] Collectives interface change

2007-08-13 Thread Li-Ta Lo

On Thu, 2007-08-09 at 14:49 -0600, Brian Barrett wrote:
> Hi all -
> 
> There was significant discussion this week at the collectives meeting  
> about improving the selection logic for collective components.  While  
> we'd like the automated collectives selection logic laid out in the  
> Collv2 document, it was decided that as a first step, we would allow  
> more than one + basic compnents to be used for a given communicator.
> 
> This mandated the change of a couple of things in the collectives  
> interface, namely how collectives module data is found (passed into a  
> function, rather tha a static pointer on the component) and a bit of  
> the initialization sequence.
> 
> The revised interface and the rest of the code is available in an svn  
> temp branch:
> 
>  https://svn.open-mpi.org/svn/ompi/tmp/bwb-coll-select
> 
> Thus far, most of the components in common use have been updated.   
> The notable exception is the tuned collectives routine, which Ollie  
> is updating in the near future.
> 
> If you have any comments on the changes, please let me know.  If not,  
> the changes will move to the trunk once Ollie is completed with  
> updating the tuned component.
> 


Done. 


Ollie

Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 4:28 PM, Adrian Knoth wrote:


I'll now compile the 1.2.3 release tarball and see if I can reproduce
the segfaults. On the other hand, I guess nobody is using OMPI on
GNU/kFreeBSD, so upgrading the openmpi-package to a subversion  
snapshot

would also fix the problem (think of "fixed in experimental").


FWIW, the OMPI subversion trunk has diverged quite a bit from the  
v1.2 branch; you might want to wait until the fixes get moved over to  
the v1.2 branch and take a snapshot from there (i.e., what will  
eventually become the v1.2.4 release).


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]

2007-08-13 Thread Adrian Knoth

On Thu, Aug 02, 2007 at 10:51:13AM +0200, Adrian Knoth wrote:

> > We (as in the Debian maintainer for Open MPI) got this bug report from
> > Uwe who sees mpi apps segfault on Debian systems with the FreeBSD
> > kernel.
> > Any input would be greatly appreciated!
> I'll follow the QEMU instructions on your website and investigate on
> my own ;)

I was able to get OMPI running on kfreebsd-amd64. I used a nightly
snapshot from the trunk, so the problem is "more or less fixed by
upstream" ;)

adi@debian:~$ ./ompi/bin/mpirun -np 2 ring
0: sending message (0) to 1
0: sent message
1: waiting for message
1: got message (1) from 0, sending to 0
0: got message (1) from 1

adi@debian:~$ ./ompi/bin/ompi_info 
Open MPI: 1.3a1r15820
   Open MPI SVN revision: r15820
Open RTE: 1.3a1r15820
   Open RTE SVN revision: r15820
OPAL: 1.3a1r15820
   OPAL SVN revision: r15820
  Prefix: /home/adi/ompi
 Configured architecture: x86_64-unknown-kfreebsd6.2-gnu

I'll now compile the 1.2.3 release tarball and see if I can reproduce
the segfaults. On the other hand, I guess nobody is using OMPI on
GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot
would also fix the problem (think of "fixed in experimental").

JFTR: It's currently not possible to compile OMPI on amd64 (out of the
box). Though it compiles on i386

http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw

it fails on amd64:

http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw

stacktrace.c: In function 'opal_show_stackframe':
stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this
function)
stacktrace.c:145: error: (Each undeclared identifier is reported only
once
stacktrace.c:145: error: for each function it appears in.)
stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this
function)
stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this
function)
make[4]: *** [stacktrace.lo] Error 1
make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util'

This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the
relevant #define's are placed in an #ifdef __i386__ condition. After
extending this for __x86_64__, everything works fine.

Should I file a bugreport against libc0.1-dev or will you take care?

I'll keep you posted...

-- 
Cluster and Metacomputing Working Group
Friedrich-Schiller-Universität Jena, Germany

private: http://adi.thur.de

Re: [OMPI devel] Problem in mpool rdma finalize

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 4:04 PM, Gleb Natapov wrote:


mpool rdma finalize was empty function. I changed it to do the
"finalize" job - go over all registered segments in mpool  and  
release

them one by one,
Mpool use reference counter for each memory region and it prevents us
from double free bug. In openib btl all memory that was registered  
with

mpool  on finalize stage will be  unregistered with mpool too.
So maybe in gm the memory (that was registred with mpool) released
directly (not via mpool) and it cause the segfault.

As far as I understand the problem Tim see is much more serious.  
During

finalize gm BTL is unloaded and only after that mpool finalize is
called. Mpool uses callbacks into gm BTL to register/unregister  
memory,

but BTL is not there already.


Right.  We had the same problem in the openib btl, too.  See https:// 
svn.open-mpi.org/trac/ompi/changeset/15735.


I don't know if this is the exact same scenario Tim is running into,  
but the end result is the same (openib btl was being destroyed and  
still leaving memory registered in the mpool).


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov

On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote:
> 
> 
> 
> On 8/13/07 3:52 PM, "Gleb Natapov"  wrote:
> 
> > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> > Here are the
> > items we have identified:
> > 
> All those things sounds very promising. Is there
> > tmp branch where you
> are going to work on this?
> 
> > 
> >
> 
>  tmp/latency
> 
> Some changes have already gone in - mainly trying to remove as much as
> possible from the isend/send path, before moving on to the list bellow.  Do
> you have cycles to help with this ?
I am very interested, not sure about cycles though. I'll get back from
my vacation next week and look over this list one more time to see where
I can help.

> 
> Rich
> 
> >  >
> > 
> > 
> > 1) remove 0 byte optimization of not initializing the convertor
> >
> > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> > ³if³ in
> > mca_pml_ob1_send_request_start_copy
> > +++
> > Measure the convertor
> > initialization before taking any other action.
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 2) get rid of mca_pml_ob1_send_request_start_prepare and  
> >
> > mca_pml_ob1_send_request_start_copy by removing the  
> >
> > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
> >
> > return OMPI_SUCCESS if the fragment can be marked as completed and  
> >
> > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> > solves
> > another problem, with IB if there are a bunch of isends  
> > outstanding we end
> > up buffering them all in the btl, marking  
> > completion and never get them on
> > the wire because the BTL runs out of  
> > credits, we never get credits back
> > until finalize because we never  
> > call progress cause the requests are
> > complete.  There is one issue  
> > here, start_prepare calls prepare_src and
> > start_copy calls alloc, I  
> > think we can work around this by just always
> > using prepare_src,  
> > OpenIB BTL will give a fragment off the free list
> > anyway because the  
> > fragment is less than the eager limit.
> > +++
> > Make the
> > BTL return different return codes for the send. If the  
> > fragment is gone,
> > then the PML is responsible of marking the MPI  
> > request as completed and so
> > on. Only the updated BTLs will get any  
> > benefit from this feature. Add a
> > flag into the descriptor to allow or  
> > not the BTL to free the fragment.
> >
> > 
> > Add a 3 level flag:
> > - BTL_HAVE_OWNERSHIP : the fragment can be released
> > by the BTL after  
> > the send, and then it report back a special return to the
> > PML
> > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
> >
> > by the BTL once the completion callback was triggered.
> > - PML_HAVE_OWNERSHIP
> > : the BTL is not allowed to release the fragment  
> > at all (the PML is
> > responsible for this).
> > 
> > Return codes:
> > - done and there will be no
> > callbacks
> > - not done, wait for a callback later
> > - error state
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 3) Change the remote callback function (and tag value based on what
> > 
> > data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
> >
> > everything!
> >  I think we need:
> > 
> >  mca_pml_ob1_recv_frag_match
> >
> >  mca_pml_ob1_recv_frag_rndv
> >  mca_pml_ob1_recv_frag_rget
> >  
> >
> >  mca_pml_ob1_recv_match_ack_copy
> >  mca_pml_ob1_recv_match_ack_pipeline
> >  
> >
> >  mca_pml_ob1_recv_copy_frag
> >  mca_pml_ob1_recv_put_request
> >
> >  mca_pml_ob1_recv_put_fin
> > +++
> > Pass the callback as parameter to the match
> > function will save us 2  
> > switches. Add more registrations in the BTL in
> > order to jump directly  
> > in the correct function (the first 3 require a
> > match while the others  
> > don't). 4 & 4 bits on the tag so each layer will
> > have 4 bits of tags  
> > [i.e. first 4 bits for the protocol tag and lower 4
> > bits they are up  
> > to the protocol] and the registration table will still be
> > local to  
> > each component.
> >
> >  >
> > 
> > 
> > 
> >  >
> > 
> > 
> > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> > 
> > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
> >
> >  I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> > take
> > a function pointer for what it should call on a successful match.
> >  So based
> > on the receive callback we can pass th

Re: [OMPI devel] Problem in mpool rdma finalize

2007-08-13 Thread Gleb Natapov

On Mon, Aug 13, 2007 at 05:00:37PM +0300, Pavel Shamis (Pasha) wrote:
> Jeff Squyres wrote:
> > FWIW: we fixed this recently in the openib BTL by ensuring that all  
> > registered memory is freed during the BTL finalize (vs. the mpool  
> > finalize).
> >
> > This is a new issue because the mpool finalize was just recently  
> > expanded to un-register all of its memory as part of the NIC-restart  
> > effort (and will likely also be needed for checkpoint/restart...?).
> >   
> mpool rdma finalize was empty function. I changed it to do the 
> "finalize" job - go over all registered segments in mpool  and release 
> them one by one,
> Mpool use reference counter for each memory region and it prevents us 
> from double free bug. In openib btl all memory that was registered with 
> mpool  on finalize stage will be  unregistered with mpool too.
> So maybe in gm the memory (that was registred with mpool) released 
> directly (not via mpool) and it cause the segfault.
> 
As far as I understand the problem Tim see is much more serious. During
finalize gm BTL is unloaded and only after that mpool finalize is
called. Mpool uses callbacks into gm BTL to register/unregister memory,
but BTL is not there already.

> Pasha
> 
> >
> >
> > On Aug 13, 2007, at 9:11 AM, Tim Prins wrote:
> >
> >   
> >> Hi folks,
> >>
> >> I have run into a problem with mca_mpool_rdma_finalize as  
> >> implemented in
> >> r15557. With the t_win onesided test, running over gm, it  
> >> segfaults. What
> >> appears to be happening is that some memory is registered with gm,  
> >> and then
> >> gets freed by mca_mpool_rdma_finalize. But the free function that  
> >> it is using
> >> is in the gm btl, and the btls are unloaded before the mpool is  
> >> shut down. So
> >> the function call segfaults.
> >>
> >> If I change the code so we never unload the btls (and we don't free  
> >> the gm
> >> port), it works fine.
> >>
> >> Note that the openib btl works just fine.
> >>
> >> Forgive me if this is a known problem, I am trying to catch up from my
> >> vacation...
> >>
> >> Tim
> >>
> >> ---
> >> If anyone cares, here is the callstack:
> >> (gdb) bt
> >> #0  0x404de825 in ?? () from /lib/libgcc_s.so.1
> >> #1  0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690)
> >> at mpool_rdma_module.c:431
> >> #2  0x400caca9 in mca_mpool_base_close () at base/ 
> >> mpool_base_close.c:57
> >> #3  0x40060094 in ompi_mpi_finalize () at runtime/ 
> >> ompi_mpi_finalize.c:304
> >> #4  0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44
> >> #5  0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214
> >> (gdb)
> >> gdb shows that at this point the gm btl is no longer loaded.
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >
> >
> >   
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Gleb.

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Richard Graham




On 8/13/07 3:52 PM, "Gleb Natapov"  wrote:

> On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> Here are the
> items we have identified:
> 
All those things sounds very promising. Is there
> tmp branch where you
are going to work on this?

> 
>

 tmp/latency

Some changes have already gone in - mainly trying to remove as much as
possible from the isend/send path, before moving on to the list bellow.  Do
you have cycles to help with this ?

Rich

>  >
> 
> 
> 1) remove 0 byte optimization of not initializing the convertor
>
> This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> ³if³ in
> mca_pml_ob1_send_request_start_copy
> +++
> Measure the convertor
> initialization before taking any other action.
>
>  >
> 
> 
> 
>  >
> 
> 
> 2) get rid of mca_pml_ob1_send_request_start_prepare and  
>
> mca_pml_ob1_send_request_start_copy by removing the  
>
> MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
>
> return OMPI_SUCCESS if the fragment can be marked as completed and  
>
> OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> solves
> another problem, with IB if there are a bunch of isends  
> outstanding we end
> up buffering them all in the btl, marking  
> completion and never get them on
> the wire because the BTL runs out of  
> credits, we never get credits back
> until finalize because we never  
> call progress cause the requests are
> complete.  There is one issue  
> here, start_prepare calls prepare_src and
> start_copy calls alloc, I  
> think we can work around this by just always
> using prepare_src,  
> OpenIB BTL will give a fragment off the free list
> anyway because the  
> fragment is less than the eager limit.
> +++
> Make the
> BTL return different return codes for the send. If the  
> fragment is gone,
> then the PML is responsible of marking the MPI  
> request as completed and so
> on. Only the updated BTLs will get any  
> benefit from this feature. Add a
> flag into the descriptor to allow or  
> not the BTL to free the fragment.
>
> 
> Add a 3 level flag:
> - BTL_HAVE_OWNERSHIP : the fragment can be released
> by the BTL after  
> the send, and then it report back a special return to the
> PML
> - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
>
> by the BTL once the completion callback was triggered.
> - PML_HAVE_OWNERSHIP
> : the BTL is not allowed to release the fragment  
> at all (the PML is
> responsible for this).
> 
> Return codes:
> - done and there will be no
> callbacks
> - not done, wait for a callback later
> - error state
>
>  >
> 
> 
> 
>  >
> 
> 
> 3) Change the remote callback function (and tag value based on what
> 
> data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
>
> everything!
>  I think we need:
> 
>  mca_pml_ob1_recv_frag_match
>
>  mca_pml_ob1_recv_frag_rndv
>  mca_pml_ob1_recv_frag_rget
>  
>
>  mca_pml_ob1_recv_match_ack_copy
>  mca_pml_ob1_recv_match_ack_pipeline
>  
>
>  mca_pml_ob1_recv_copy_frag
>  mca_pml_ob1_recv_put_request
>
>  mca_pml_ob1_recv_put_fin
> +++
> Pass the callback as parameter to the match
> function will save us 2  
> switches. Add more registrations in the BTL in
> order to jump directly  
> in the correct function (the first 3 require a
> match while the others  
> don't). 4 & 4 bits on the tag so each layer will
> have 4 bits of tags  
> [i.e. first 4 bits for the protocol tag and lower 4
> bits they are up  
> to the protocol] and the registration table will still be
> local to  
> each component.
>
>  >
> 
> 
> 
>  >
> 
> 
> 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same
> 
> switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
>
>  I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> take
> a function pointer for what it should call on a successful match.
>  So based
> on the receive callback we can pass the correct scheduling  
> function to
> invoke into the generic mca_pml_ob1_recv_frag_match
> 
> Recv_request progress
> is call in a generic way from multiple places,  
> and we do a big switch
> inside. In the match function we might want to  
> pass a function pointer to
> the successful match progress function.  
> This way we will be able to
> specialize what happens after the match,  
> in a more optimized way. Or the
> recv_request_match can return the  
> match and then the caller will have to
> specialize it's action.
>
> ---

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov

On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote:
> Here are the items we have identified:
> 
All those things sounds very promising. Is there tmp branch where you
are going to work on this?

> 
>  
> 
> 
> 1) remove 0 byte optimization of not initializing the convertor
>   This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
> “if“ in mca_pml_ob1_send_request_start_copy
> +++
> Measure the convertor initialization before taking any other action.
>  
> 
> 
>  
> 
> 
> 2) get rid of mca_pml_ob1_send_request_start_prepare and  
> mca_pml_ob1_send_request_start_copy by removing the  
> MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
> return OMPI_SUCCESS if the fragment can be marked as completed and  
> OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
> solves another problem, with IB if there are a bunch of isends  
> outstanding we end up buffering them all in the btl, marking  
> completion and never get them on the wire because the BTL runs out of  
> credits, we never get credits back until finalize because we never  
> call progress cause the requests are complete.  There is one issue  
> here, start_prepare calls prepare_src and start_copy calls alloc, I  
> think we can work around this by just always using prepare_src,  
> OpenIB BTL will give a fragment off the free list anyway because the  
> fragment is less than the eager limit.
> +++
> Make the BTL return different return codes for the send. If the  
> fragment is gone, then the PML is responsible of marking the MPI  
> request as completed and so on. Only the updated BTLs will get any  
> benefit from this feature. Add a flag into the descriptor to allow or  
> not the BTL to free the fragment.
> 
> Add a 3 level flag:
> - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after  
> the send, and then it report back a special return to the PML
> - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
> by the BTL once the completion callback was triggered.
> - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment  
> at all (the PML is responsible for this).
> 
> Return codes:
> - done and there will be no callbacks
> - not done, wait for a callback later
> - error state
>  
> 
> 
>  
> 
> 
> 3) Change the remote callback function (and tag value based on what  
> data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
> everything!
>   I think we need:
> 
>   mca_pml_ob1_recv_frag_match
>   mca_pml_ob1_recv_frag_rndv
>   mca_pml_ob1_recv_frag_rget
>   
>   mca_pml_ob1_recv_match_ack_copy
>   mca_pml_ob1_recv_match_ack_pipeline
>   
>   mca_pml_ob1_recv_copy_frag
>   mca_pml_ob1_recv_put_request
>   mca_pml_ob1_recv_put_fin
> +++
> Pass the callback as parameter to the match function will save us 2  
> switches. Add more registrations in the BTL in order to jump directly  
> in the correct function (the first 3 require a match while the others  
> don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags  
> [i.e. first 4 bits for the protocol tag and lower 4 bits they are up  
> to the protocol] and the registration table will still be local to  
> each component.
>  
> 
> 
>  
> 
> 
> 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same  
> switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
>   I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
> take a function pointer for what it should call on a successful match.
>   So based on the receive callback we can pass the correct scheduling  
> function to invoke into the generic mca_pml_ob1_recv_frag_match
> 
> Recv_request progress is call in a generic way from multiple places,  
> and we do a big switch inside. In the match function we might want to  
> pass a function pointer to the successful match progress function.  
> This way we will be able to specialize what happens after the match,  
> in a more optimized way. Or the recv_request_match can return the  
> match and then the caller will have to specialize it's action.
>  
> 
> 
>  
> 
> 
> 5) Don't initialize the entire request. We can use item 2 below (if  
> we get back OMPI_SUCCESS from btl_send) then we don't need to fully

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Richard Graham

On 8/13/07 12:34 PM, "Galen Shipman"  wrote:

> 
 Ok here is the numbers on my machines:
 0 bytes
 mvapich with header caching: 1.56
 mvapich without  header caching: 1.79
 ompi 1.2: 1.59

 So on zero bytes ompi not so bad. Also we can see that header
 caching
 decrease the mvapich latency on 0.23

 1 bytes
 mvapich with header caching: 1.58
 mvapich without  header caching: 1.83
 ompi 1.2: 1.73
> 
> 
> Is this just convertor initialization cost?

Last night I measured the cost of the convertor initialization in ob1 on my
dual processor mac, using ompi-tests/simple/ping/mpi-ping, and it costs 0.02
to 0.03 microseconds. To be specific, I commented out the check for 0 byte
message size, and the latency went up from about 0.59 usec (this is with
modified code in tmp/latency) to about 0.62 usec.

Rich

> 
> - Galen
> 

 And here ompi make some latency jump.

 In mvapich the header caching decrease the header size from
 56bytes to
 12bytes.
 What is the header size (pml + btl) in ompi ?
>>> 
>>> The match header size is 16 bytes, so it looks like ours is already
>>> optimized ...
>> So for 0 bytes message we are sending only 16bytes on the wire , is it
>> correct ?
>> 
>> 
>> Pasha.
>>> 
>>>   george.
>>> 

 Pasha
 ___
 devel mailing list
 de...@open-mpi.org
 http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 11:12 AM, Galen Shipman wrote:


1) remove 0 byte optimization of not initializing the convertor
  This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an
“if“ in mca_pml_ob1_send_request_start_copy
+++
Measure the convertor initialization before taking any other action.
-- 
--


I talked with Galen and then with Pasha; Pasha will look into this.   
Specifically:


- Investigate ob1 and find all the places we're doing 0-byte  
optimizations (I don't think that there are any in the openib btl...?).


- Selectively remove each of the zero-byte optimizations and measure  
what the cost is, both in terms of time and cycles (using the RDTSC  
macro/inline function that's somewhere already in OMPI).  If  
possible, it would be best to measure these individually rather than  
removing all of them and looking at the aggregate.


- Do all of this with and without heterogeneous support enabled to  
measure what the cost of heterogeneity is.


This will enable us to find out where the time is being spent.   
Clearly, there's some differences between zero and nonzero byte  
messages, so it would be a good first step to understand exactly what  
they are.


-- 
--


2) get rid of mca_pml_ob1_send_request_start_prepare and


This is also all good stuff; let's look into the zero-byte  
optimizations first and then tackle the rest of these after that.


Good?

--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 11:28 AM, George Bosilca wrote:


Such a scheme is certainly possible, but I see even less use for it
than use cases for the existing microbenchmarks.  Specifically,
header caching *can* happen in real applications (i.e., repeatedly
send short messages with the same MPI signature), but repeatedly
sending to the same peer with exactly the same signature *and*
exactly the same "long-enough" data (i.e., more than a small number
of ints that an app could use for its own message data caching) is
indicative of a poorly-written MPI application IMHO.


If you look at the message size distribution for most of the HPC
applications (at least one that get investigated in the papers) you
will see that very small messages are only an non-significant
percentage of messages.


This would be different than what Patrick has told us about Myricom's  
analysis of real world MPI applications and one of the strong points  
of QLogic's HCAs (that it's all about short message latency /  
injection rate; bandwidth issues are [at least currently]  
secondary).  :-)



As this "optimization" only address these
kind of messages, I doubt there is any real benefit from applications
point of view (obviously there will be few exceptions as usual). The
header caching only make sense for very small messages (MVAPICH only
implement header caching for messages up to 155 bytes [that's less
than 20 doubles] if I remember well), which make it a real benchmark
optimization.


I don't have enough data to say.  But I'm sure there are at least  
*some* applications out there that would benefit from it.  Probably  
somewhere between 1 and 99%.  ;-)


But just to reiterate/be clear: my goal here is to reduce latency.   
If header caching is not the way to go, then so be it.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)


Brian Barrett wrote:

On Aug 13, 2007, at 9:33 AM, George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...


Pasha -- Is your build of Open MPI built with 
--disable-heterogeneous?  If not, our headers all grow slightly to 
support heterogeneous operations.  For the heterogeneous case, a 1 
byte message includes:
I didn't build with "--disable-heterogeneous". So the heterogeneous 
support was enabled in the build


  16 bytes for the match header
  4 bytes for the Open IB header
  1 byte for the payload
 
  21 bytes total

If you are using eager RDMA, there's an extra 4 bytes for the RDMA 
length in the footer.  Without heterogeneous support, 2 bytes get 
knocked off the size of the match header, so the whole thing will be 
19 bytes (+ 4 for the eager RDMA footer).
I used eager rdma - it is faster than send.  So the message size on the 
wire for 1 byte in my case was - 25bytes  VS 13bytes in mvapich. And If 
i will --disable-heterogeneous it will decrease 2 bytes. So it sound 
like we are pretty optimized.




There are also considerably more ifs in the code if heterogeneous is 
used, especially on x86 machines.


Brian

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Galen Shipman




Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header  
caching

decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73



Is this just convertor initialization cost?

- Galen



And here ompi make some latency jump.

In mvapich the header caching decrease the header size from  
56bytes to

12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...

So for 0 bytes message we are sending only 16bytes on the wire , is it
correct ?


Pasha.


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)


George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already 
optimized ...
So for 0 bytes message we are sending only 16bytes on the wire , is it 
correct ?



Pasha.


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Brian Barrett


On Aug 13, 2007, at 9:33 AM, George Bosilca wrote:


On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from  
56bytes to

12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already
optimized ...


Pasha -- Is your build of Open MPI built with --disable- 
heterogeneous?  If not, our headers all grow slightly to support  
heterogeneous operations.  For the heterogeneous case, a 1 byte  
message includes:


  16 bytes for the match header
  4 bytes for the Open IB header
  1 byte for the payload
 
  21 bytes total

If you are using eager RDMA, there's an extra 4 bytes for the RDMA  
length in the footer.  Without heterogeneous support, 2 bytes get  
knocked off the size of the match header, so the whole thing will be  
19 bytes (+ 4 for the eager RDMA footer).


There are also considerably more ifs in the code if heterogeneous is  
used, especially on x86 machines.


Brian

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca



On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote:


Jeff Squyres wrote:

I guess reading the graph that Pasha sent is difficult; Pasha -- can
you send the actual numbers?


Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching
decrease the mvapich latency on 0.23

1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to
12bytes.
What is the header size (pml + btl) in ompi ?


The match header size is 16 bytes, so it looks like ours is already  
optimized ...


  george.



Pasha
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca



On Aug 13, 2007, at 11:07 AM, Jeff Squyres wrote:


Such a scheme is certainly possible, but I see even less use for it
than use cases for the existing microbenchmarks.  Specifically,
header caching *can* happen in real applications (i.e., repeatedly
send short messages with the same MPI signature), but repeatedly
sending to the same peer with exactly the same signature *and*
exactly the same "long-enough" data (i.e., more than a small number
of ints that an app could use for its own message data caching) is
indicative of a poorly-written MPI application IMHO.


If you look at the message size distribution for most of the HPC  
applications (at least one that get investigated in the papers) you  
will see that very small messages are only an non-significant  
percentage of messages. As this "optimization" only address these  
kind of messages, I doubt there is any real benefit from applications  
point of view (obviously there will be few exceptions as usual). The  
header caching only make sense for very small messages (MVAPICH only  
implement header caching for messages up to 155 bytes [that's less  
than 20 doubles] if I remember well), which make it a real benchmark  
optimization.





But don't complain if your Linpack run fails.


I assume you're talking about bugs in the implementation; not a
problem with the approach, right?


Of course, there is no apparent problem with my approach :) It is  
called an educated guess based on repetitive human behaviors analysis.


  george.

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:
I guess reading the graph that Pasha sent is difficult; Pasha -- can  
you send the actual numbers?
  

Ok here is the numbers on my machines:
0 bytes
mvapich with header caching: 1.56
mvapich without  header caching: 1.79
ompi 1.2: 1.59

So on zero bytes ompi not so bad. Also we can see that header caching 
decrease the mvapich latency on 0.23


1 bytes
mvapich with header caching: 1.58
mvapich without  header caching: 1.83
ompi 1.2: 1.73

And here ompi make some latency jump.

In mvapich the header caching decrease the header size from 56bytes to 
12bytes.

What is the header size (pml + btl) in ompi ?

Pasha

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Galen Shipman



I think we need to take a step back from micro-optimizations such as  
header caching.


Rich, George, Brian and I are currently looking into latency  
improvements. We came up with several areas of performance  
enhancements that can be done with minimal disruption. The progress  
issue that Christian and others have pointed does appear to be a  
problem, but will take a bit more work. I would like to see progress  
in these areas first as I really don't like the idea of caching more  
endpoint state in OMPI for micro-benchmark latency improvements until  
we are certain we have done the ground work for improving latency in  
the general case.





Here are the items we have identified:


 



1) remove 0 byte optimization of not initializing the convertor
 This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an  
“if“ in mca_pml_ob1_send_request_start_copy

+++
Measure the convertor initialization before taking any other action.
 



 



2) get rid of mca_pml_ob1_send_request_start_prepare and  
mca_pml_ob1_send_request_start_copy by removing the  
MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send  
return OMPI_SUCCESS if the fragment can be marked as completed and  
OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This  
solves another problem, with IB if there are a bunch of isends  
outstanding we end up buffering them all in the btl, marking  
completion and never get them on the wire because the BTL runs out of  
credits, we never get credits back until finalize because we never  
call progress cause the requests are complete.  There is one issue  
here, start_prepare calls prepare_src and start_copy calls alloc, I  
think we can work around this by just always using prepare_src,  
OpenIB BTL will give a fragment off the free list anyway because the  
fragment is less than the eager limit.

+++
Make the BTL return different return codes for the send. If the  
fragment is gone, then the PML is responsible of marking the MPI  
request as completed and so on. Only the updated BTLs will get any  
benefit from this feature. Add a flag into the descriptor to allow or  
not the BTL to free the fragment.


Add a 3 level flag:
- BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after  
the send, and then it report back a special return to the PML
- BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released  
by the BTL once the completion callback was triggered.
- PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment  
at all (the PML is responsible for this).


Return codes:
- done and there will be no callbacks
- not done, wait for a callback later
- error state
 



 



3) Change the remote callback function (and tag value based on what  
data we are sending), don't use mca_pml_ob1_recv_frag_callback for  
everything!

I think we need:

mca_pml_ob1_recv_frag_match
mca_pml_ob1_recv_frag_rndv
mca_pml_ob1_recv_frag_rget

mca_pml_ob1_recv_match_ack_copy
mca_pml_ob1_recv_match_ack_pipeline

mca_pml_ob1_recv_copy_frag
mca_pml_ob1_recv_put_request
mca_pml_ob1_recv_put_fin
+++
Pass the callback as parameter to the match function will save us 2  
switches. Add more registrations in the BTL in order to jump directly  
in the correct function (the first 3 require a match while the others  
don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags  
[i.e. first 4 bits for the protocol tag and lower 4 bits they are up  
to the protocol] and the registration table will still be local to  
each component.
 



 



4) Get rid of mca_pml_ob1_recv_request_progress; this does the same  
switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback!
	I think what we can do here is modify mca_pml_ob1_recv_frag_match to  
take a function pointer for what it should call on a successful match.
	So based on the receive callback we can pass the correct scheduling  
function to invoke into the generic mca_pml_ob1_recv_frag_match


Recv_request progress is call in a generic way from multiple places,  
and we do a big switch inside. In the match function we might want to  
pass a function pointer to the successful match progress function.  
This way we will be able to specialize what happens after the match,  
in a more optimized way. Or the recv_request_match can return the  
match and then the caller will have to specialize it's action.
---

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca

We're working on it. Give us few weeks to finish implementing all the  
planned optimizations/cleanups in th PML and then we can talk about  
tricks. We're expecting/hoping to slim down the PML layer by more  
than 0.5 so this header caching optimization might not make any sense  
at that point.


  Thanks,
george.

On Aug 13, 2007, at 10:38 AM, Jeff Squyres wrote:


On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote:


All this being said -- is there another reason to lower our latency?
My main goal here is to lower the latency.  If header caching is
unattractive, then another method would be fine.


Oops: s/reason/way/.  That makes my sentence make much more  
sense.  :-)

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 10:49 AM, George Bosilca wrote:


You want a dirtier trick for benchmarks ... Here it is ...

Implement a compression like algorithm based on checksum. The data-
type engine can compute a checksum for each fragment and if the
checksum match one in the peer [limitted] history (so we can claim
our communication protocol is adaptive), then we replace the actual
message content with the matched id in the common history. Checksums
are fairly cheap, lookup in a balanced tree is cheap too, so we will
end up with a lot of improvement (as instead of sending a full
fragment we will end-up sending one int). Based on the way most of
the benchmarks initialize the user data  (when they don't everything
is mostly 0), this trick might work on all cases for the
benchmarks ...


Are you sure you didn't want to publish a paper about this before you  
sent it across a public list?  Now someone else is likely to "invent"  
this scheme and get credit for it.  ;-)


Such a scheme is certainly possible, but I see even less use for it  
than use cases for the existing microbenchmarks.  Specifically,  
header caching *can* happen in real applications (i.e., repeatedly  
send short messages with the same MPI signature), but repeatedly  
sending to the same peer with exactly the same signature *and*  
exactly the same "long-enough" data (i.e., more than a small number  
of ints that an app could use for its own message data caching) is  
indicative of a poorly-written MPI application IMHO.



But don't complain if your Linpack run fails.


I assume you're talking about bugs in the implementation; not a  
problem with the approach, right?


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Christian Bell

On Sun, 12 Aug 2007, Gleb Natapov wrote:

> > Any objections?  We can discuss what approaches we want to take  
> > (there's going to be some complications because of the PML driver,  
> > etc.); perhaps in the Tuesday Mellanox teleconf...?
> > 
> My main objection is that the only reason you propose to do this is some
> bogus benchmark? Is there any other reason to implement header caching?
> I also hope you don't propose to break layering and somehow cache PML headers
> in BTL.

Gleb is hitting the main points I wanted to bring up.  We had
examined this header caching in the context of PSM a little while
ago.  0.5us is much more than we had observed -- at 3GHz, 0.5us would
be about 1500 cycles of code that has little amounts of branches.
For us, with a much bigger header and more fields to fetch from
different structures, it was more like 350 cycles which is on the
order of 0.1us and not worth the effort (in code complexity,
readability and frankly motivation for performance).  Maybe there's
more to it than just "code caching" -- like sending from pre-pinned
headers or using the RDMA with immediate, etc.  But I'd be suprised
to find out that openib btl doesn't do the best thing here.

I have pretty good evidence that for CM, the latency difference comes
from the receive-side (in particular opal_progress).  Doesn't the
openib btl receive-side do something similiar with opal_progress,
i.e. register a callback function?  It probably does something
different like check a few RDMA mailboxes (or per-peer landing pads)
but anything that gets called before or after it as part of
opal_progress is cause for slowdown.

. . christian

-- 
christian.b...@qlogic.com
(QLogic Host Solutions Group, formerly Pathscale)

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread George Bosilca


You want a dirtier trick for benchmarks ... Here it is ...

Implement a compression like algorithm based on checksum. The data- 
type engine can compute a checksum for each fragment and if the  
checksum match one in the peer [limitted] history (so we can claim  
our communication protocol is adaptive), then we replace the actual  
message content with the matched id in the common history. Checksums  
are fairly cheap, lookup in a balanced tree is cheap too, so we will  
end up with a lot of improvement (as instead of sending a full  
fragment we will end-up sending one int). Based on the way most of  
the benchmarks initialize the user data  (when they don't everything  
is mostly 0), this trick might work on all cases for the  
benchmarks ... But don't complain if your Linpack run fails.


  george.

On Aug 13, 2007, at 10:39 AM, Gleb Natapov wrote:


On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote:

In short: it's an even dirtier trick than header caching (for
example), and we'd get beat up about it.


That was joke :) (But 3D drivers really do such things :( )

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov

On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote:
> On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote:
> 
> >> Pallas, Presta (as i know) also use static rank. So lets start to fix
> >> all "bogus" benchmarks :-) ?
> >>
> > All benchmarks are bogus. I have better optimization. Check a name of
> > executable and if this is some know benchmark send one byte instead of
> > real message. 3D driver do this why can't we.
> 
> Because we'd end up in an arms race of benchmark argv[0] name and  
> what is hard-coded in Open MPI.  Users/customers/partners would soon  
> enough figure out that this is what we're doing and either use "mv"  
> or "ln -s" to get around our hack and see the real numbers anyway.
> 
> In short: it's an even dirtier trick than header caching (for  
> example), and we'd get beat up about it.
> 
That was joke :) (But 3D drivers really do such things :( )

--
Gleb.

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote:


All this being said -- is there another reason to lower our latency?
My main goal here is to lower the latency.  If header caching is
unattractive, then another method would be fine.


Oops: s/reason/way/.  That makes my sentence make much more sense.  :-)

--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote:


Pallas, Presta (as i know) also use static rank. So lets start to fix
all "bogus" benchmarks :-) ?


All benchmarks are bogus. I have better optimization. Check a name of
executable and if this is some know benchmark send one byte instead of
real message. 3D driver do this why can't we.


Because we'd end up in an arms race of benchmark argv[0] name and  
what is hard-coded in Open MPI.  Users/customers/partners would soon  
enough figure out that this is what we're doing and either use "mv"  
or "ln -s" to get around our hack and see the real numbers anyway.


In short: it's an even dirtier trick than header caching (for  
example), and we'd get beat up about it.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Jeff Squyres


On Aug 12, 2007, at 3:49 PM, Gleb Natapov wrote:


- Mellanox tested MVAPICH with the header caching; latency was around
1.4us
- Mellanox tested MVAPICH without the header caching; latency was
around 1.9us


As far as I remember Mellanox results and according to our testing
difference between MVAPICH with header caching and OMPI is 0.2-0.3us.
Not 0.5us. And MVAPICH without header caching is actually worse then
OMPI for small messages.


I guess reading the graph that Pasha sent is difficult; Pasha -- can  
you send the actual numbers?



Given that OMPI is the lone outlier around 1.9us, I think we have no
choice except to implement the header caching and/or examine our
header to see if we can shrink it.  Mellanox has volunteered to
implement header caching in the openib btl.
I think we have a chose. Not implement header caching, but just  
change the

osu_latency benchmark to send each message with different tag :)


If only.  :-)

But that misses the point (and the fact that all the common ping-pong  
benchmarks use a single tag: NetPIPE, IMB, osu_latency, etc.).  *All  
other MPI's* give us latency around 1.4us, but Open MPI is around  
1.9us.  So we need to do something.


Are we optimizing for a benchmark?  Yes.  But we have to do it.  Many  
people know that these benchmarks are fairly useless, but not enough  
-- too many customers do not, and education is not enough.  "Sure  
this MPI looks slower but, really, it isn't.  Trust me; my name is  
Joe Isuzu."  That's a hard sell.



I am not against header caching per se, but if it will complicate code
even a little bit I don't think we should implemented it just to  
benefit one

fabricated benchmark (AFAIR before header caching was implemented in
MVAPICH mpi_latency actually sent messages with different tags).


That may be true and a reason for us to wail and gnash our teeth, but  
it doesn't change the current reality.


Also there is really nothing to cache in openib BTL. Openin BTL  
header is 4

bytes long. The caching will have to be done in OB1 and there it will
affect every other interconnect.


Surely there is *something* we can do -- what, exactly, is the  
objection to peeking inside the PML header down in the btl?  Is it  
really so horrible for a btl to look inside the upper layer's  
header?  I agree that the PML looking into a btl header would  
[obviously] be Bad.


All this being said -- is there another reason to lower our latency?   
My main goal here is to lower the latency.  If header caching is  
unattractive, then another method would be fine.


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] Problem in mpool rdma finalize

2007-08-13 Thread Pavel Shamis (Pasha)


Jeff Squyres wrote:
FWIW: we fixed this recently in the openib BTL by ensuring that all  
registered memory is freed during the BTL finalize (vs. the mpool  
finalize).


This is a new issue because the mpool finalize was just recently  
expanded to un-register all of its memory as part of the NIC-restart  
effort (and will likely also be needed for checkpoint/restart...?).
  
mpool rdma finalize was empty function. I changed it to do the 
"finalize" job - go over all registered segments in mpool  and release 
them one by one,
Mpool use reference counter for each memory region and it prevents us 
from double free bug. In openib btl all memory that was registered with 
mpool  on finalize stage will be  unregistered with mpool too.
So maybe in gm the memory (that was registred with mpool) released 
directly (not via mpool) and it cause the segfault.


Pasha




On Aug 13, 2007, at 9:11 AM, Tim Prins wrote:

  

Hi folks,

I have run into a problem with mca_mpool_rdma_finalize as  
implemented in
r15557. With the t_win onesided test, running over gm, it  
segfaults. What
appears to be happening is that some memory is registered with gm,  
and then
gets freed by mca_mpool_rdma_finalize. But the free function that  
it is using
is in the gm btl, and the btls are unloaded before the mpool is  
shut down. So

the function call segfaults.

If I change the code so we never unload the btls (and we don't free  
the gm

port), it works fine.

Note that the openib btl works just fine.

Forgive me if this is a known problem, I am trying to catch up from my
vacation...

Tim

---
If anyone cares, here is the callstack:
(gdb) bt
#0  0x404de825 in ?? () from /lib/libgcc_s.so.1
#1  0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690)
at mpool_rdma_module.c:431
#2  0x400caca9 in mca_mpool_base_close () at base/ 
mpool_base_close.c:57
#3  0x40060094 in ompi_mpi_finalize () at runtime/ 
ompi_mpi_finalize.c:304

#4  0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44
#5  0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214
(gdb)
gdb shows that at this point the gm btl is no longer loaded.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] Problem in mpool rdma finalize

2007-08-13 Thread Jeff Squyres

FWIW: we fixed this recently in the openib BTL by ensuring that all  
registered memory is freed during the BTL finalize (vs. the mpool  
finalize).


This is a new issue because the mpool finalize was just recently  
expanded to un-register all of its memory as part of the NIC-restart  
effort (and will likely also be needed for checkpoint/restart...?).




On Aug 13, 2007, at 9:11 AM, Tim Prins wrote:


Hi folks,

I have run into a problem with mca_mpool_rdma_finalize as  
implemented in
r15557. With the t_win onesided test, running over gm, it  
segfaults. What
appears to be happening is that some memory is registered with gm,  
and then
gets freed by mca_mpool_rdma_finalize. But the free function that  
it is using
is in the gm btl, and the btls are unloaded before the mpool is  
shut down. So

the function call segfaults.

If I change the code so we never unload the btls (and we don't free  
the gm

port), it works fine.

Note that the openib btl works just fine.

Forgive me if this is a known problem, I am trying to catch up from my
vacation...

Tim

---
If anyone cares, here is the callstack:
(gdb) bt
#0  0x404de825 in ?? () from /lib/libgcc_s.so.1
#1  0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690)
at mpool_rdma_module.c:431
#2  0x400caca9 in mca_mpool_base_close () at base/ 
mpool_base_close.c:57
#3  0x40060094 in ompi_mpi_finalize () at runtime/ 
ompi_mpi_finalize.c:304

#4  0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44
#5  0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214
(gdb)
gdb shows that at this point the gm btl is no longer loaded.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

[OMPI devel] Problem in mpool rdma finalize

2007-08-13 Thread Tim Prins

Hi folks,

I have run into a problem with mca_mpool_rdma_finalize as implemented in 
r15557. With the t_win onesided test, running over gm, it segfaults. What 
appears to be happening is that some memory is registered with gm, and then 
gets freed by mca_mpool_rdma_finalize. But the free function that it is using 
is in the gm btl, and the btls are unloaded before the mpool is shut down. So 
the function call segfaults.

If I change the code so we never unload the btls (and we don't free the gm 
port), it works fine.

Note that the openib btl works just fine.

Forgive me if this is a known problem, I am trying to catch up from my 
vacation...

Tim

---
If anyone cares, here is the callstack:
(gdb) bt
#0  0x404de825 in ?? () from /lib/libgcc_s.so.1
#1  0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690)
at mpool_rdma_module.c:431
#2  0x400caca9 in mca_mpool_base_close () at base/mpool_base_close.c:57
#3  0x40060094 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:304
#4  0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44
#5  0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214
(gdb)
gdb shows that at this point the gm btl is no longer loaded.

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Scott Atchley


On Aug 13, 2007, at 4:06 AM, Pavel Shamis (Pasha) wrote:


Any objections?  We can discuss what approaches we want to take
(there's going to be some complications because of the PML driver,
etc.); perhaps in the Tuesday Mellanox teleconf...?


My main objection is that the only reason you propose to do this  
is some

bogus benchmark?


Pallas, Presta (as i know) also use static rank. So lets start to fix
all "bogus" benchmarks :-) ?

Pasha.


Why not:

   for (i=0; i < ITERATIONS; i++) {
  tag = i%MPI_TAG_UB;
  ...
   }

On a related note, we have often discussed the fact that benchmarks  
only give an upper-bound on performance. I would expect that some  
users would want to also know the lower-bound. For example, set a  
flag that causes the benchmark to use a different buffer each time in  
order to cause the registration cache to miss. I am sure we could  
come up with some other cases.


Scott

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Terry D. Dontje


Jeff Squyres wrote:

With Mellanox's new HCA (ConnectX), extremely low latencies are  
possible for short messages between two MPI processes.  Currently,  
OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel  
MPI, MVAPICH[2], etc.) are around 1.4us.  A big reason for this  
difference is that, at least with MVAPICH[2], they are doing wire  
protocol header caching where the openib BTL does not.  Specifically:


- Mellanox tested MVAPICH with the header caching; latency was around  
1.4us
- Mellanox tested MVAPICH without the header caching; latency was  
around 1.9us


Given that OMPI is the lone outlier around 1.9us, I think we have no  
choice except to implement the header caching and/or examine our  
header to see if we can shrink it.  Mellanox has volunteered to  
implement header caching in the openib btl.


Any objections?  We can discuss what approaches we want to take  
(there's going to be some complications because of the PML driver,  
etc.); perhaps in the Tuesday Mellanox teleconf...?
 


This sounds great.  Sun, would like to hear how thing are being done
so we can possibly port the solution to the udapl btl.

--td

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Gleb Natapov

On Mon, Aug 13, 2007 at 11:06:00AM +0300, Pavel Shamis (Pasha) wrote:
> 
> >   
> >> Any objections?  We can discuss what approaches we want to take  
> >> (there's going to be some complications because of the PML driver,  
> >> etc.); perhaps in the Tuesday Mellanox teleconf...?
> >>
> >> 
> > My main objection is that the only reason you propose to do this is some
> > bogus benchmark? 
> >   
> Pallas, Presta (as i know) also use static rank. So lets start to fix 
> all "bogus" benchmarks :-) ?
> 
All benchmarks are bogus. I have better optimization. Check a name of
executable and if this is some know benchmark send one byte instead of
real message. 3D driver do this why can't we.


--
Gleb.

Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Pavel Shamis (Pasha)



  
Any objections?  We can discuss what approaches we want to take  
(there's going to be some complications because of the PML driver,  
etc.); perhaps in the Tuesday Mellanox teleconf...?




My main objection is that the only reason you propose to do this is some
bogus benchmark? 
  
Pallas, Presta (as i know) also use static rank. So lets start to fix 
all "bogus" benchmarks :-) ?


Pasha.

37 matches

Mail list logo