Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]
(adding pkg-openmpi-maintain...@lists.alioth.debian.org which I should have added earlier, sorry! --Dirk) On 14 August 2007 at 00:08, Adrian Knoth wrote: | On Mon, Aug 13, 2007 at 04:26:31PM -0500, Dirk Eddelbuettel wrote: | | > > I'll now compile the 1.2.3 release tarball and see if I can reproduce | | The 1.2.3 release also works fine: | | adi@debian:~$ ./ompi123/bin/mpirun -np 2 ring | 0: sending message (0) to 1 | 0: sent message | 1: waiting for message | 1: got message (1) from 0, sending to 0 | 0: got message (1) from 1 Now I'm even more confused. I though the bug was that it segfaulted when used on a Debian-on-freebsd-kernel system ? | adi@debian:~$ ./ompi123/bin/ompi_info | Open MPI: 1.2.3 |Open MPI SVN revision: r15136 | Open RTE: 1.2.3 |Open RTE SVN revision: r15136 | OPAL: 1.2.3 |OPAL SVN revision: r15136 | Prefix: /home/adi/ompi123 | Configured architecture: x86_64-unknown-kfreebsd6.2-gnu | | > > the segfaults. On the other hand, I guess nobody is using OMPI on | > > GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot | > > would also fix the problem (think of "fixed in experimental"). | > Well, I generally prefer to follow upstream releases, and Jeff from the | > upstream team echoed that. Let's wait for 1.2.4, shall we? | | That's fine, v1.2 is the production release. | | > | JFTR: It's currently not possible to compile OMPI on amd64 (out of the | > | box). Though it compiles on i386 | > | | > | http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw | > | | > | it fails on amd64: | > | | > | http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw | > | | > | stacktrace.c: In function 'opal_show_stackframe': | > | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this | > | function) | > | stacktrace.c:145: error: (Each undeclared identifier is reported only | > | once | > | stacktrace.c:145: error: for each function it appears in.) | > | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this | > | function) | > | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this | > | function) | > | make[4]: *** [stacktrace.lo] Error 1 | > | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util' | > | | > | | > | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the | > | relevant #define's are placed in an #ifdef __i386__ condition. After | > | extending this for __x86_64__, everything works fine. | > | | > | Should I file a bugreport against libc0.1-dev or will you take care? | > I'm confused. What is libc0.1-dev? | | |http://packages.debian.org/unstable/libdevel/libc0.1-dev | | It's the "libc6-dev" for GNU/kFreeBSD, at least that's how I understand | it. I see, thanks. Well if the bug is in the header files supplied by that package, please go ahead and file a bug report. | > Also note that I happened to have uploaded a third Debian revision of 1.2.3 | > yesterday, and that Debian release 1.2.3-3 built fine on amd as per: | > | > http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log | > | > So are we sure there's a bug? | | Yes, absolutely. I was a little bit imprecise: with amd64, I ment | kfreebsd-amd64, not Linux-amd64. Ack. | If you follow my two links and read their headlines, you can see that | these are the buildlogs of 1.2.3-3 on kfreebsd, working for i386, but | failing for amd64. | | This is caused by "wrong" libc headers on kfreebsd, that's why I thought | Uwe might want to have a look at it. Ok. Back to the initial bug of Open MPI on Debian/kFreeBSD. What exactly is the status now? Thanks, Dirk -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]
On Mon, Aug 13, 2007 at 04:26:31PM -0500, Dirk Eddelbuettel wrote: > > I'll now compile the 1.2.3 release tarball and see if I can reproduce The 1.2.3 release also works fine: adi@debian:~$ ./ompi123/bin/mpirun -np 2 ring 0: sending message (0) to 1 0: sent message 1: waiting for message 1: got message (1) from 0, sending to 0 0: got message (1) from 1 adi@debian:~$ ./ompi123/bin/ompi_info Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 Prefix: /home/adi/ompi123 Configured architecture: x86_64-unknown-kfreebsd6.2-gnu > > the segfaults. On the other hand, I guess nobody is using OMPI on > > GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot > > would also fix the problem (think of "fixed in experimental"). > Well, I generally prefer to follow upstream releases, and Jeff from the > upstream team echoed that. Let's wait for 1.2.4, shall we? That's fine, v1.2 is the production release. > | JFTR: It's currently not possible to compile OMPI on amd64 (out of the > | box). Though it compiles on i386 > | > | > http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw > | > | it fails on amd64: > | > | > http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw > | > | stacktrace.c: In function 'opal_show_stackframe': > | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this > | function) > | stacktrace.c:145: error: (Each undeclared identifier is reported only > | once > | stacktrace.c:145: error: for each function it appears in.) > | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this > | function) > | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this > | function) > | make[4]: *** [stacktrace.lo] Error 1 > | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util' > | > | > | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the > | relevant #define's are placed in an #ifdef __i386__ condition. After > | extending this for __x86_64__, everything works fine. > | > | Should I file a bugreport against libc0.1-dev or will you take care? > I'm confused. What is libc0.1-dev? http://packages.debian.org/unstable/libdevel/libc0.1-dev It's the "libc6-dev" for GNU/kFreeBSD, at least that's how I understand it. > Also note that I happened to have uploaded a third Debian revision of 1.2.3 > yesterday, and that Debian release 1.2.3-3 built fine on amd as per: > > http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log > > So are we sure there's a bug? Yes, absolutely. I was a little bit imprecise: with amd64, I ment kfreebsd-amd64, not Linux-amd64. If you follow my two links and read their headlines, you can see that these are the buildlogs of 1.2.3-3 on kfreebsd, working for i386, but failing for amd64. This is caused by "wrong" libc headers on kfreebsd, that's why I thought Uwe might want to have a look at it. -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]
Adrian, On 13 August 2007 at 22:28, Adrian Knoth wrote: | On Thu, Aug 02, 2007 at 10:51:13AM +0200, Adrian Knoth wrote: | | > > We (as in the Debian maintainer for Open MPI) got this bug report from | > > Uwe who sees mpi apps segfault on Debian systems with the FreeBSD | > > kernel. | > > Any input would be greatly appreciated! | > I'll follow the QEMU instructions on your website and investigate on | > my own ;) | | I was able to get OMPI running on kfreebsd-amd64. I used a nightly | snapshot from the trunk, so the problem is "more or less fixed by | upstream" ;) | | adi@debian:~$ ./ompi/bin/mpirun -np 2 ring | 0: sending message (0) to 1 | 0: sent message | 1: waiting for message | 1: got message (1) from 0, sending to 0 | 0: got message (1) from 1 | | adi@debian:~$ ./ompi/bin/ompi_info | Open MPI: 1.3a1r15820 |Open MPI SVN revision: r15820 | Open RTE: 1.3a1r15820 |Open RTE SVN revision: r15820 | OPAL: 1.3a1r15820 |OPAL SVN revision: r15820 | Prefix: /home/adi/ompi | Configured architecture: x86_64-unknown-kfreebsd6.2-gnu | | | I'll now compile the 1.2.3 release tarball and see if I can reproduce I really appreciate the help. | the segfaults. On the other hand, I guess nobody is using OMPI on | GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot | would also fix the problem (think of "fixed in experimental"). Well, I generally prefer to follow upstream releases, and Jeff from the upstream team echoed that. Let's wait for 1.2.4, shall we? OTOH if you can back out a patch for 1.2.3, I'd apply that. | | JFTR: It's currently not possible to compile OMPI on amd64 (out of the | box). Though it compiles on i386 | | http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw | | it fails on amd64: | | http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw | | stacktrace.c: In function 'opal_show_stackframe': | stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this | function) | stacktrace.c:145: error: (Each undeclared identifier is reported only | once | stacktrace.c:145: error: for each function it appears in.) | stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this | function) | stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this | function) | make[4]: *** [stacktrace.lo] Error 1 | make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util' | | | This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the | relevant #define's are placed in an #ifdef __i386__ condition. After | extending this for __x86_64__, everything works fine. | | Should I file a bugreport against libc0.1-dev or will you take care? I'm confused. What is libc0.1-dev ? Also note that I happened to have uploaded a third Debian revision of 1.2.3 yesterday, and that Debian release 1.2.3-3 built fine on amd as per: http://buildd.debian.org/build.php?&pkg=openmpi&ver=1.2.3-3&arch=amd64&file=log So are we sure there's a bug? Maybe you were just bitten by something in SVN that is not yet deemed release quality? | I'll keep you posted... I appreciate that. Cheers, Dirk | -- | Cluster and Metacomputing Working Group | Friedrich-Schiller-Universität Jena, Germany | | private: http://adi.thur.de | ___ | devel mailing list | de...@open-mpi.org | http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Three out of two people have difficulties with fractions.
Re: [OMPI devel] Collectives interface change
On Thu, 2007-08-09 at 14:49 -0600, Brian Barrett wrote: > Hi all - > > There was significant discussion this week at the collectives meeting > about improving the selection logic for collective components. While > we'd like the automated collectives selection logic laid out in the > Collv2 document, it was decided that as a first step, we would allow > more than one + basic compnents to be used for a given communicator. > > This mandated the change of a couple of things in the collectives > interface, namely how collectives module data is found (passed into a > function, rather tha a static pointer on the component) and a bit of > the initialization sequence. > > The revised interface and the rest of the code is available in an svn > temp branch: > > https://svn.open-mpi.org/svn/ompi/tmp/bwb-coll-select > > Thus far, most of the components in common use have been updated. > The notable exception is the tuned collectives routine, which Ollie > is updating in the near future. > > If you have any comments on the changes, please let me know. If not, > the changes will move to the trunk once Ollie is completed with > updating the tuned component. > Done. Ollie
Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]
On Aug 13, 2007, at 4:28 PM, Adrian Knoth wrote: I'll now compile the 1.2.3 release tarball and see if I can reproduce the segfaults. On the other hand, I guess nobody is using OMPI on GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot would also fix the problem (think of "fixed in experimental"). FWIW, the OMPI subversion trunk has diverged quite a bit from the v1.2 branch; you might want to wait until the fixes get moved over to the v1.2 branch and take a snapshot from there (i.e., what will eventually become the v1.2.4 release). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [u...@hermann-uwe.de: [Pkg-openmpi-maintainers] Bug#435581: openmpi-bin: Segfault on Debian GNU/kFreeBSD]
On Thu, Aug 02, 2007 at 10:51:13AM +0200, Adrian Knoth wrote: > > We (as in the Debian maintainer for Open MPI) got this bug report from > > Uwe who sees mpi apps segfault on Debian systems with the FreeBSD > > kernel. > > Any input would be greatly appreciated! > I'll follow the QEMU instructions on your website and investigate on > my own ;) I was able to get OMPI running on kfreebsd-amd64. I used a nightly snapshot from the trunk, so the problem is "more or less fixed by upstream" ;) adi@debian:~$ ./ompi/bin/mpirun -np 2 ring 0: sending message (0) to 1 0: sent message 1: waiting for message 1: got message (1) from 0, sending to 0 0: got message (1) from 1 adi@debian:~$ ./ompi/bin/ompi_info Open MPI: 1.3a1r15820 Open MPI SVN revision: r15820 Open RTE: 1.3a1r15820 Open RTE SVN revision: r15820 OPAL: 1.3a1r15820 OPAL SVN revision: r15820 Prefix: /home/adi/ompi Configured architecture: x86_64-unknown-kfreebsd6.2-gnu I'll now compile the 1.2.3 release tarball and see if I can reproduce the segfaults. On the other hand, I guess nobody is using OMPI on GNU/kFreeBSD, so upgrading the openmpi-package to a subversion snapshot would also fix the problem (think of "fixed in experimental"). JFTR: It's currently not possible to compile OMPI on amd64 (out of the box). Though it compiles on i386 http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-i386&stamp=1187000200&file=log&as=raw it fails on amd64: http://experimental.debian.net/fetch.php?&pkg=openmpi&ver=1.2.3-3&arch=kfreebsd-amd64&stamp=1186969782&file=log&as=raw stacktrace.c: In function 'opal_show_stackframe': stacktrace.c:145: error: 'FPE_FLTDIV' undeclared (first use in this function) stacktrace.c:145: error: (Each undeclared identifier is reported only once stacktrace.c:145: error: for each function it appears in.) stacktrace.c:146: error: 'FPE_FLTOVF' undeclared (first use in this function) stacktrace.c:147: error: 'FPE_FLTUND' undeclared (first use in this function) make[4]: *** [stacktrace.lo] Error 1 make[4]: Leaving directory `/build/buildd/openmpi-1.2.3/opal/util' This is caused by libc0.1-dev in /usr/include/bits/sigcontext.h, the relevant #define's are placed in an #ifdef __i386__ condition. After extending this for __x86_64__, everything works fine. Should I file a bugreport against libc0.1-dev or will you take care? I'll keep you posted... -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
Re: [OMPI devel] Problem in mpool rdma finalize
On Aug 13, 2007, at 4:04 PM, Gleb Natapov wrote: mpool rdma finalize was empty function. I changed it to do the "finalize" job - go over all registered segments in mpool and release them one by one, Mpool use reference counter for each memory region and it prevents us from double free bug. In openib btl all memory that was registered with mpool on finalize stage will be unregistered with mpool too. So maybe in gm the memory (that was registred with mpool) released directly (not via mpool) and it cause the segfault. As far as I understand the problem Tim see is much more serious. During finalize gm BTL is unloaded and only after that mpool finalize is called. Mpool uses callbacks into gm BTL to register/unregister memory, but BTL is not there already. Right. We had the same problem in the openib btl, too. See https:// svn.open-mpi.org/trac/ompi/changeset/15735. I don't know if this is the exact same scenario Tim is running into, but the end result is the same (openib btl was being destroyed and still leaving memory registered in the mpool). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 03:59:28PM -0400, Richard Graham wrote: > > > > On 8/13/07 3:52 PM, "Gleb Natapov" wrote: > > > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > > Here are the > > items we have identified: > > > All those things sounds very promising. Is there > > tmp branch where you > are going to work on this? > > > > > > > tmp/latency > > Some changes have already gone in - mainly trying to remove as much as > possible from the isend/send path, before moving on to the list bellow. Do > you have cycles to help with this ? I am very interested, not sure about cycles though. I'll get back from my vacation next week and look over this list one more time to see where I can help. > > Rich > > > > > > > > > > 1) remove 0 byte optimization of not initializing the convertor > > > > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an > > ³if³ in > > mca_pml_ob1_send_request_start_copy > > +++ > > Measure the convertor > > initialization before taking any other action. > > > > > > > > > > > > > > > > > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > > > > mca_pml_ob1_send_request_start_copy by removing the > > > > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > > > > return OMPI_SUCCESS if the fragment can be marked as completed and > > > > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > > solves > > another problem, with IB if there are a bunch of isends > > outstanding we end > > up buffering them all in the btl, marking > > completion and never get them on > > the wire because the BTL runs out of > > credits, we never get credits back > > until finalize because we never > > call progress cause the requests are > > complete. There is one issue > > here, start_prepare calls prepare_src and > > start_copy calls alloc, I > > think we can work around this by just always > > using prepare_src, > > OpenIB BTL will give a fragment off the free list > > anyway because the > > fragment is less than the eager limit. > > +++ > > Make the > > BTL return different return codes for the send. If the > > fragment is gone, > > then the PML is responsible of marking the MPI > > request as completed and so > > on. Only the updated BTLs will get any > > benefit from this feature. Add a > > flag into the descriptor to allow or > > not the BTL to free the fragment. > > > > > > Add a 3 level flag: > > - BTL_HAVE_OWNERSHIP : the fragment can be released > > by the BTL after > > the send, and then it report back a special return to the > > PML > > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > > > > by the BTL once the completion callback was triggered. > > - PML_HAVE_OWNERSHIP > > : the BTL is not allowed to release the fragment > > at all (the PML is > > responsible for this). > > > > Return codes: > > - done and there will be no > > callbacks > > - not done, wait for a callback later > > - error state > > > > > > > > > > > > > > > > > > > > 3) Change the remote callback function (and tag value based on what > > > > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > > > > everything! > > I think we need: > > > > mca_pml_ob1_recv_frag_match > > > > mca_pml_ob1_recv_frag_rndv > > mca_pml_ob1_recv_frag_rget > > > > > > mca_pml_ob1_recv_match_ack_copy > > mca_pml_ob1_recv_match_ack_pipeline > > > > > > mca_pml_ob1_recv_copy_frag > > mca_pml_ob1_recv_put_request > > > > mca_pml_ob1_recv_put_fin > > +++ > > Pass the callback as parameter to the match > > function will save us 2 > > switches. Add more registrations in the BTL in > > order to jump directly > > in the correct function (the first 3 require a > > match while the others > > don't). 4 & 4 bits on the tag so each layer will > > have 4 bits of tags > > [i.e. first 4 bits for the protocol tag and lower 4 > > bits they are up > > to the protocol] and the registration table will still be > > local to > > each component. > > > > > > > > > > > > > > > > > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > > > > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > > > > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > > take > > a function pointer for what it should call on a successful match. > > So based > > on the receive callback we can pass th
Re: [OMPI devel] Problem in mpool rdma finalize
On Mon, Aug 13, 2007 at 05:00:37PM +0300, Pavel Shamis (Pasha) wrote: > Jeff Squyres wrote: > > FWIW: we fixed this recently in the openib BTL by ensuring that all > > registered memory is freed during the BTL finalize (vs. the mpool > > finalize). > > > > This is a new issue because the mpool finalize was just recently > > expanded to un-register all of its memory as part of the NIC-restart > > effort (and will likely also be needed for checkpoint/restart...?). > > > mpool rdma finalize was empty function. I changed it to do the > "finalize" job - go over all registered segments in mpool and release > them one by one, > Mpool use reference counter for each memory region and it prevents us > from double free bug. In openib btl all memory that was registered with > mpool on finalize stage will be unregistered with mpool too. > So maybe in gm the memory (that was registred with mpool) released > directly (not via mpool) and it cause the segfault. > As far as I understand the problem Tim see is much more serious. During finalize gm BTL is unloaded and only after that mpool finalize is called. Mpool uses callbacks into gm BTL to register/unregister memory, but BTL is not there already. > Pasha > > > > > > > On Aug 13, 2007, at 9:11 AM, Tim Prins wrote: > > > > > >> Hi folks, > >> > >> I have run into a problem with mca_mpool_rdma_finalize as > >> implemented in > >> r15557. With the t_win onesided test, running over gm, it > >> segfaults. What > >> appears to be happening is that some memory is registered with gm, > >> and then > >> gets freed by mca_mpool_rdma_finalize. But the free function that > >> it is using > >> is in the gm btl, and the btls are unloaded before the mpool is > >> shut down. So > >> the function call segfaults. > >> > >> If I change the code so we never unload the btls (and we don't free > >> the gm > >> port), it works fine. > >> > >> Note that the openib btl works just fine. > >> > >> Forgive me if this is a known problem, I am trying to catch up from my > >> vacation... > >> > >> Tim > >> > >> --- > >> If anyone cares, here is the callstack: > >> (gdb) bt > >> #0 0x404de825 in ?? () from /lib/libgcc_s.so.1 > >> #1 0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690) > >> at mpool_rdma_module.c:431 > >> #2 0x400caca9 in mca_mpool_base_close () at base/ > >> mpool_base_close.c:57 > >> #3 0x40060094 in ompi_mpi_finalize () at runtime/ > >> ompi_mpi_finalize.c:304 > >> #4 0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44 > >> #5 0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214 > >> (gdb) > >> gdb shows that at this point the gm btl is no longer loaded. > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > > > > > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.
Re: [OMPI devel] openib btl header caching
On 8/13/07 3:52 PM, "Gleb Natapov" wrote: > On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > Here are the > items we have identified: > All those things sounds very promising. Is there > tmp branch where you are going to work on this? > > tmp/latency Some changes have already gone in - mainly trying to remove as much as possible from the isend/send path, before moving on to the list bellow. Do you have cycles to help with this ? Rich > > > > > 1) remove 0 byte optimization of not initializing the convertor > > This costs us an ³if³ in MCA_PML_BASE_SEND_REQUEST_INIT and an > ³if³ in > mca_pml_ob1_send_request_start_copy > +++ > Measure the convertor > initialization before taking any other action. > > > > > > > > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > > mca_pml_ob1_send_request_start_copy by removing the > > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > > return OMPI_SUCCESS if the fragment can be marked as completed and > > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > solves > another problem, with IB if there are a bunch of isends > outstanding we end > up buffering them all in the btl, marking > completion and never get them on > the wire because the BTL runs out of > credits, we never get credits back > until finalize because we never > call progress cause the requests are > complete. There is one issue > here, start_prepare calls prepare_src and > start_copy calls alloc, I > think we can work around this by just always > using prepare_src, > OpenIB BTL will give a fragment off the free list > anyway because the > fragment is less than the eager limit. > +++ > Make the > BTL return different return codes for the send. If the > fragment is gone, > then the PML is responsible of marking the MPI > request as completed and so > on. Only the updated BTLs will get any > benefit from this feature. Add a > flag into the descriptor to allow or > not the BTL to free the fragment. > > > Add a 3 level flag: > - BTL_HAVE_OWNERSHIP : the fragment can be released > by the BTL after > the send, and then it report back a special return to the > PML > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > > by the BTL once the completion callback was triggered. > - PML_HAVE_OWNERSHIP > : the BTL is not allowed to release the fragment > at all (the PML is > responsible for this). > > Return codes: > - done and there will be no > callbacks > - not done, wait for a callback later > - error state > > > > > > > > > > > 3) Change the remote callback function (and tag value based on what > > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > > everything! > I think we need: > > mca_pml_ob1_recv_frag_match > > mca_pml_ob1_recv_frag_rndv > mca_pml_ob1_recv_frag_rget > > > mca_pml_ob1_recv_match_ack_copy > mca_pml_ob1_recv_match_ack_pipeline > > > mca_pml_ob1_recv_copy_frag > mca_pml_ob1_recv_put_request > > mca_pml_ob1_recv_put_fin > +++ > Pass the callback as parameter to the match > function will save us 2 > switches. Add more registrations in the BTL in > order to jump directly > in the correct function (the first 3 require a > match while the others > don't). 4 & 4 bits on the tag so each layer will > have 4 bits of tags > [i.e. first 4 bits for the protocol tag and lower 4 > bits they are up > to the protocol] and the registration table will still be > local to > each component. > > > > > > > > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > take > a function pointer for what it should call on a successful match. > So based > on the receive callback we can pass the correct scheduling > function to > invoke into the generic mca_pml_ob1_recv_frag_match > > Recv_request progress > is call in a generic way from multiple places, > and we do a big switch > inside. In the match function we might want to > pass a function pointer to > the successful match progress function. > This way we will be able to > specialize what happens after the match, > in a more optimized way. Or the > recv_request_match can return the > match and then the caller will have to > specialize it's action. > > ---
Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 09:12:33AM -0600, Galen Shipman wrote: > Here are the items we have identified: > All those things sounds very promising. Is there tmp branch where you are going to work on this? > > > > > 1) remove 0 byte optimization of not initializing the convertor > This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an > “if“ in mca_pml_ob1_send_request_start_copy > +++ > Measure the convertor initialization before taking any other action. > > > > > > > 2) get rid of mca_pml_ob1_send_request_start_prepare and > mca_pml_ob1_send_request_start_copy by removing the > MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send > return OMPI_SUCCESS if the fragment can be marked as completed and > OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This > solves another problem, with IB if there are a bunch of isends > outstanding we end up buffering them all in the btl, marking > completion and never get them on the wire because the BTL runs out of > credits, we never get credits back until finalize because we never > call progress cause the requests are complete. There is one issue > here, start_prepare calls prepare_src and start_copy calls alloc, I > think we can work around this by just always using prepare_src, > OpenIB BTL will give a fragment off the free list anyway because the > fragment is less than the eager limit. > +++ > Make the BTL return different return codes for the send. If the > fragment is gone, then the PML is responsible of marking the MPI > request as completed and so on. Only the updated BTLs will get any > benefit from this feature. Add a flag into the descriptor to allow or > not the BTL to free the fragment. > > Add a 3 level flag: > - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after > the send, and then it report back a special return to the PML > - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released > by the BTL once the completion callback was triggered. > - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment > at all (the PML is responsible for this). > > Return codes: > - done and there will be no callbacks > - not done, wait for a callback later > - error state > > > > > > > 3) Change the remote callback function (and tag value based on what > data we are sending), don't use mca_pml_ob1_recv_frag_callback for > everything! > I think we need: > > mca_pml_ob1_recv_frag_match > mca_pml_ob1_recv_frag_rndv > mca_pml_ob1_recv_frag_rget > > mca_pml_ob1_recv_match_ack_copy > mca_pml_ob1_recv_match_ack_pipeline > > mca_pml_ob1_recv_copy_frag > mca_pml_ob1_recv_put_request > mca_pml_ob1_recv_put_fin > +++ > Pass the callback as parameter to the match function will save us 2 > switches. Add more registrations in the BTL in order to jump directly > in the correct function (the first 3 require a match while the others > don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags > [i.e. first 4 bits for the protocol tag and lower 4 bits they are up > to the protocol] and the registration table will still be local to > each component. > > > > > > > 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same > switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! > I think what we can do here is modify mca_pml_ob1_recv_frag_match to > take a function pointer for what it should call on a successful match. > So based on the receive callback we can pass the correct scheduling > function to invoke into the generic mca_pml_ob1_recv_frag_match > > Recv_request progress is call in a generic way from multiple places, > and we do a big switch inside. In the match function we might want to > pass a function pointer to the successful match progress function. > This way we will be able to specialize what happens after the match, > in a more optimized way. Or the recv_request_match can return the > match and then the caller will have to specialize it's action. > > > > > > > 5) Don't initialize the entire request. We can use item 2 below (if > we get back OMPI_SUCCESS from btl_send) then we don't need to fully
Re: [OMPI devel] openib btl header caching
On 8/13/07 12:34 PM, "Galen Shipman" wrote: > Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 > > > Is this just convertor initialization cost? Last night I measured the cost of the convertor initialization in ob1 on my dual processor mac, using ompi-tests/simple/ping/mpi-ping, and it costs 0.02 to 0.03 microseconds. To be specific, I commented out the check for 0 byte message size, and the latency went up from about 0.59 usec (this is with modified code in tmp/latency) to about 0.62 usec. Rich > > - Galen > And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? >>> >>> The match header size is 16 bytes, so it looks like ours is already >>> optimized ... >> So for 0 bytes message we are sending only 16bytes on the wire , is it >> correct ? >> >> >> Pasha. >>> >>> george. >>> Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:12 AM, Galen Shipman wrote: 1) remove 0 byte optimization of not initializing the convertor This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an “if“ in mca_pml_ob1_send_request_start_copy +++ Measure the convertor initialization before taking any other action. -- -- I talked with Galen and then with Pasha; Pasha will look into this. Specifically: - Investigate ob1 and find all the places we're doing 0-byte optimizations (I don't think that there are any in the openib btl...?). - Selectively remove each of the zero-byte optimizations and measure what the cost is, both in terms of time and cycles (using the RDTSC macro/inline function that's somewhere already in OMPI). If possible, it would be best to measure these individually rather than removing all of them and looking at the aggregate. - Do all of this with and without heterogeneous support enabled to measure what the cost of heterogeneity is. This will enable us to find out where the time is being spent. Clearly, there's some differences between zero and nonzero byte messages, so it would be a good first step to understand exactly what they are. -- -- 2) get rid of mca_pml_ob1_send_request_start_prepare and This is also all good stuff; let's look into the zero-byte optimizations first and then tackle the rest of these after that. Good? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:28 AM, George Bosilca wrote: Such a scheme is certainly possible, but I see even less use for it than use cases for the existing microbenchmarks. Specifically, header caching *can* happen in real applications (i.e., repeatedly send short messages with the same MPI signature), but repeatedly sending to the same peer with exactly the same signature *and* exactly the same "long-enough" data (i.e., more than a small number of ints that an app could use for its own message data caching) is indicative of a poorly-written MPI application IMHO. If you look at the message size distribution for most of the HPC applications (at least one that get investigated in the papers) you will see that very small messages are only an non-significant percentage of messages. This would be different than what Patrick has told us about Myricom's analysis of real world MPI applications and one of the strong points of QLogic's HCAs (that it's all about short message latency / injection rate; bandwidth issues are [at least currently] secondary). :-) As this "optimization" only address these kind of messages, I doubt there is any real benefit from applications point of view (obviously there will be few exceptions as usual). The header caching only make sense for very small messages (MVAPICH only implement header caching for messages up to 155 bytes [that's less than 20 doubles] if I remember well), which make it a real benchmark optimization. I don't have enough data to say. But I'm sure there are at least *some* applications out there that would benefit from it. Probably somewhere between 1 and 99%. ;-) But just to reiterate/be clear: my goal here is to reduce latency. If header caching is not the way to go, then so be it. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
Brian Barrett wrote: On Aug 13, 2007, at 9:33 AM, George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... Pasha -- Is your build of Open MPI built with --disable-heterogeneous? If not, our headers all grow slightly to support heterogeneous operations. For the heterogeneous case, a 1 byte message includes: I didn't build with "--disable-heterogeneous". So the heterogeneous support was enabled in the build 16 bytes for the match header 4 bytes for the Open IB header 1 byte for the payload 21 bytes total If you are using eager RDMA, there's an extra 4 bytes for the RDMA length in the footer. Without heterogeneous support, 2 bytes get knocked off the size of the match header, so the whole thing will be 19 bytes (+ 4 for the eager RDMA footer). I used eager rdma - it is faster than send. So the message size on the wire for 1 byte in my case was - 25bytes VS 13bytes in mvapich. And If i will --disable-heterogeneous it will decrease 2 bytes. So it sound like we are pretty optimized. There are also considerably more ifs in the code if heterogeneous is used, especially on x86 machines. Brian
Re: [OMPI devel] openib btl header caching
Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 Is this just convertor initialization cost? - Galen And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... So for 0 bytes message we are sending only 16bytes on the wire , is it correct ? Pasha. george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... So for 0 bytes message we are sending only 16bytes on the wire , is it correct ? Pasha. george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 9:33 AM, George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... Pasha -- Is your build of Open MPI built with --disable- heterogeneous? If not, our headers all grow slightly to support heterogeneous operations. For the heterogeneous case, a 1 byte message includes: 16 bytes for the match header 4 bytes for the Open IB header 1 byte for the payload 21 bytes total If you are using eager RDMA, there's an extra 4 bytes for the RDMA length in the footer. Without heterogeneous support, 2 bytes get knocked off the size of the match header, so the whole thing will be 19 bytes (+ 4 for the eager RDMA footer). There are also considerably more ifs in the code if heterogeneous is used, especially on x86 machines. Brian
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... george. Pasha ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 11:07 AM, Jeff Squyres wrote: Such a scheme is certainly possible, but I see even less use for it than use cases for the existing microbenchmarks. Specifically, header caching *can* happen in real applications (i.e., repeatedly send short messages with the same MPI signature), but repeatedly sending to the same peer with exactly the same signature *and* exactly the same "long-enough" data (i.e., more than a small number of ints that an app could use for its own message data caching) is indicative of a poorly-written MPI application IMHO. If you look at the message size distribution for most of the HPC applications (at least one that get investigated in the papers) you will see that very small messages are only an non-significant percentage of messages. As this "optimization" only address these kind of messages, I doubt there is any real benefit from applications point of view (obviously there will be few exceptions as usual). The header caching only make sense for very small messages (MVAPICH only implement header caching for messages up to 155 bytes [that's less than 20 doubles] if I remember well), which make it a real benchmark optimization. But don't complain if your Linpack run fails. I assume you're talking about bugs in the implementation; not a problem with the approach, right? Of course, there is no apparent problem with my approach :) It is called an educated guess based on repetitive human behaviors analysis. george.
Re: [OMPI devel] openib btl header caching
Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? Pasha
Re: [OMPI devel] openib btl header caching
I think we need to take a step back from micro-optimizations such as header caching. Rich, George, Brian and I are currently looking into latency improvements. We came up with several areas of performance enhancements that can be done with minimal disruption. The progress issue that Christian and others have pointed does appear to be a problem, but will take a bit more work. I would like to see progress in these areas first as I really don't like the idea of caching more endpoint state in OMPI for micro-benchmark latency improvements until we are certain we have done the ground work for improving latency in the general case. Here are the items we have identified: 1) remove 0 byte optimization of not initializing the convertor This costs us an “if“ in MCA_PML_BASE_SEND_REQUEST_INIT and an “if“ in mca_pml_ob1_send_request_start_copy +++ Measure the convertor initialization before taking any other action. 2) get rid of mca_pml_ob1_send_request_start_prepare and mca_pml_ob1_send_request_start_copy by removing the MCA_BTL_FLAGS_SEND_INPLACE flag. Instead we can simply have btl_send return OMPI_SUCCESS if the fragment can be marked as completed and OMPI_NOT_ON_WIRE if the fragment cannot be marked as complete. This solves another problem, with IB if there are a bunch of isends outstanding we end up buffering them all in the btl, marking completion and never get them on the wire because the BTL runs out of credits, we never get credits back until finalize because we never call progress cause the requests are complete. There is one issue here, start_prepare calls prepare_src and start_copy calls alloc, I think we can work around this by just always using prepare_src, OpenIB BTL will give a fragment off the free list anyway because the fragment is less than the eager limit. +++ Make the BTL return different return codes for the send. If the fragment is gone, then the PML is responsible of marking the MPI request as completed and so on. Only the updated BTLs will get any benefit from this feature. Add a flag into the descriptor to allow or not the BTL to free the fragment. Add a 3 level flag: - BTL_HAVE_OWNERSHIP : the fragment can be released by the BTL after the send, and then it report back a special return to the PML - BTL_HAVE_OWNERSHIP_AFTER_CALLBACK : the fragment will be released by the BTL once the completion callback was triggered. - PML_HAVE_OWNERSHIP : the BTL is not allowed to release the fragment at all (the PML is responsible for this). Return codes: - done and there will be no callbacks - not done, wait for a callback later - error state 3) Change the remote callback function (and tag value based on what data we are sending), don't use mca_pml_ob1_recv_frag_callback for everything! I think we need: mca_pml_ob1_recv_frag_match mca_pml_ob1_recv_frag_rndv mca_pml_ob1_recv_frag_rget mca_pml_ob1_recv_match_ack_copy mca_pml_ob1_recv_match_ack_pipeline mca_pml_ob1_recv_copy_frag mca_pml_ob1_recv_put_request mca_pml_ob1_recv_put_fin +++ Pass the callback as parameter to the match function will save us 2 switches. Add more registrations in the BTL in order to jump directly in the correct function (the first 3 require a match while the others don't). 4 & 4 bits on the tag so each layer will have 4 bits of tags [i.e. first 4 bits for the protocol tag and lower 4 bits they are up to the protocol] and the registration table will still be local to each component. 4) Get rid of mca_pml_ob1_recv_request_progress; this does the same switch on hdr->hdr_common.hdr_type that mca_pml_ob1_recv_frag_callback! I think what we can do here is modify mca_pml_ob1_recv_frag_match to take a function pointer for what it should call on a successful match. So based on the receive callback we can pass the correct scheduling function to invoke into the generic mca_pml_ob1_recv_frag_match Recv_request progress is call in a generic way from multiple places, and we do a big switch inside. In the match function we might want to pass a function pointer to the successful match progress function. This way we will be able to specialize what happens after the match, in a more optimized way. Or the recv_request_match can return the match and then the caller will have to specialize it's action. ---
Re: [OMPI devel] openib btl header caching
We're working on it. Give us few weeks to finish implementing all the planned optimizations/cleanups in th PML and then we can talk about tricks. We're expecting/hoping to slim down the PML layer by more than 0.5 so this header caching optimization might not make any sense at that point. Thanks, george. On Aug 13, 2007, at 10:38 AM, Jeff Squyres wrote: On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote: All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. Oops: s/reason/way/. That makes my sentence make much more sense. :-)
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 10:49 AM, George Bosilca wrote: You want a dirtier trick for benchmarks ... Here it is ... Implement a compression like algorithm based on checksum. The data- type engine can compute a checksum for each fragment and if the checksum match one in the peer [limitted] history (so we can claim our communication protocol is adaptive), then we replace the actual message content with the matched id in the common history. Checksums are fairly cheap, lookup in a balanced tree is cheap too, so we will end up with a lot of improvement (as instead of sending a full fragment we will end-up sending one int). Based on the way most of the benchmarks initialize the user data (when they don't everything is mostly 0), this trick might work on all cases for the benchmarks ... Are you sure you didn't want to publish a paper about this before you sent it across a public list? Now someone else is likely to "invent" this scheme and get credit for it. ;-) Such a scheme is certainly possible, but I see even less use for it than use cases for the existing microbenchmarks. Specifically, header caching *can* happen in real applications (i.e., repeatedly send short messages with the same MPI signature), but repeatedly sending to the same peer with exactly the same signature *and* exactly the same "long-enough" data (i.e., more than a small number of ints that an app could use for its own message data caching) is indicative of a poorly-written MPI application IMHO. But don't complain if your Linpack run fails. I assume you're talking about bugs in the implementation; not a problem with the approach, right? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Sun, 12 Aug 2007, Gleb Natapov wrote: > > Any objections? We can discuss what approaches we want to take > > (there's going to be some complications because of the PML driver, > > etc.); perhaps in the Tuesday Mellanox teleconf...? > > > My main objection is that the only reason you propose to do this is some > bogus benchmark? Is there any other reason to implement header caching? > I also hope you don't propose to break layering and somehow cache PML headers > in BTL. Gleb is hitting the main points I wanted to bring up. We had examined this header caching in the context of PSM a little while ago. 0.5us is much more than we had observed -- at 3GHz, 0.5us would be about 1500 cycles of code that has little amounts of branches. For us, with a much bigger header and more fields to fetch from different structures, it was more like 350 cycles which is on the order of 0.1us and not worth the effort (in code complexity, readability and frankly motivation for performance). Maybe there's more to it than just "code caching" -- like sending from pre-pinned headers or using the RDMA with immediate, etc. But I'd be suprised to find out that openib btl doesn't do the best thing here. I have pretty good evidence that for CM, the latency difference comes from the receive-side (in particular opal_progress). Doesn't the openib btl receive-side do something similiar with opal_progress, i.e. register a callback function? It probably does something different like check a few RDMA mailboxes (or per-peer landing pads) but anything that gets called before or after it as part of opal_progress is cause for slowdown. . . christian -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale)
Re: [OMPI devel] openib btl header caching
You want a dirtier trick for benchmarks ... Here it is ... Implement a compression like algorithm based on checksum. The data- type engine can compute a checksum for each fragment and if the checksum match one in the peer [limitted] history (so we can claim our communication protocol is adaptive), then we replace the actual message content with the matched id in the common history. Checksums are fairly cheap, lookup in a balanced tree is cheap too, so we will end up with a lot of improvement (as instead of sending a full fragment we will end-up sending one int). Based on the way most of the benchmarks initialize the user data (when they don't everything is mostly 0), this trick might work on all cases for the benchmarks ... But don't complain if your Linpack run fails. george. On Aug 13, 2007, at 10:39 AM, Gleb Natapov wrote: On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote: In short: it's an even dirtier trick than header caching (for example), and we'd get beat up about it. That was joke :) (But 3D drivers really do such things :( )
Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 10:36:19AM -0400, Jeff Squyres wrote: > On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote: > > >> Pallas, Presta (as i know) also use static rank. So lets start to fix > >> all "bogus" benchmarks :-) ? > >> > > All benchmarks are bogus. I have better optimization. Check a name of > > executable and if this is some know benchmark send one byte instead of > > real message. 3D driver do this why can't we. > > Because we'd end up in an arms race of benchmark argv[0] name and > what is hard-coded in Open MPI. Users/customers/partners would soon > enough figure out that this is what we're doing and either use "mv" > or "ln -s" to get around our hack and see the real numbers anyway. > > In short: it's an even dirtier trick than header caching (for > example), and we'd get beat up about it. > That was joke :) (But 3D drivers really do such things :( ) -- Gleb.
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 10:34 AM, Jeff Squyres wrote: All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. Oops: s/reason/way/. That makes my sentence make much more sense. :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 6:36 AM, Gleb Natapov wrote: Pallas, Presta (as i know) also use static rank. So lets start to fix all "bogus" benchmarks :-) ? All benchmarks are bogus. I have better optimization. Check a name of executable and if this is some know benchmark send one byte instead of real message. 3D driver do this why can't we. Because we'd end up in an arms race of benchmark argv[0] name and what is hard-coded in Open MPI. Users/customers/partners would soon enough figure out that this is what we're doing and either use "mv" or "ln -s" to get around our hack and see the real numbers anyway. In short: it's an even dirtier trick than header caching (for example), and we'd get beat up about it. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] openib btl header caching
On Aug 12, 2007, at 3:49 PM, Gleb Natapov wrote: - Mellanox tested MVAPICH with the header caching; latency was around 1.4us - Mellanox tested MVAPICH without the header caching; latency was around 1.9us As far as I remember Mellanox results and according to our testing difference between MVAPICH with header caching and OMPI is 0.2-0.3us. Not 0.5us. And MVAPICH without header caching is actually worse then OMPI for small messages. I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Given that OMPI is the lone outlier around 1.9us, I think we have no choice except to implement the header caching and/or examine our header to see if we can shrink it. Mellanox has volunteered to implement header caching in the openib btl. I think we have a chose. Not implement header caching, but just change the osu_latency benchmark to send each message with different tag :) If only. :-) But that misses the point (and the fact that all the common ping-pong benchmarks use a single tag: NetPIPE, IMB, osu_latency, etc.). *All other MPI's* give us latency around 1.4us, but Open MPI is around 1.9us. So we need to do something. Are we optimizing for a benchmark? Yes. But we have to do it. Many people know that these benchmarks are fairly useless, but not enough -- too many customers do not, and education is not enough. "Sure this MPI looks slower but, really, it isn't. Trust me; my name is Joe Isuzu." That's a hard sell. I am not against header caching per se, but if it will complicate code even a little bit I don't think we should implemented it just to benefit one fabricated benchmark (AFAIR before header caching was implemented in MVAPICH mpi_latency actually sent messages with different tags). That may be true and a reason for us to wail and gnash our teeth, but it doesn't change the current reality. Also there is really nothing to cache in openib BTL. Openin BTL header is 4 bytes long. The caching will have to be done in OB1 and there it will affect every other interconnect. Surely there is *something* we can do -- what, exactly, is the objection to peeking inside the PML header down in the btl? Is it really so horrible for a btl to look inside the upper layer's header? I agree that the PML looking into a btl header would [obviously] be Bad. All this being said -- is there another reason to lower our latency? My main goal here is to lower the latency. If header caching is unattractive, then another method would be fine. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Problem in mpool rdma finalize
Jeff Squyres wrote: FWIW: we fixed this recently in the openib BTL by ensuring that all registered memory is freed during the BTL finalize (vs. the mpool finalize). This is a new issue because the mpool finalize was just recently expanded to un-register all of its memory as part of the NIC-restart effort (and will likely also be needed for checkpoint/restart...?). mpool rdma finalize was empty function. I changed it to do the "finalize" job - go over all registered segments in mpool and release them one by one, Mpool use reference counter for each memory region and it prevents us from double free bug. In openib btl all memory that was registered with mpool on finalize stage will be unregistered with mpool too. So maybe in gm the memory (that was registred with mpool) released directly (not via mpool) and it cause the segfault. Pasha On Aug 13, 2007, at 9:11 AM, Tim Prins wrote: Hi folks, I have run into a problem with mca_mpool_rdma_finalize as implemented in r15557. With the t_win onesided test, running over gm, it segfaults. What appears to be happening is that some memory is registered with gm, and then gets freed by mca_mpool_rdma_finalize. But the free function that it is using is in the gm btl, and the btls are unloaded before the mpool is shut down. So the function call segfaults. If I change the code so we never unload the btls (and we don't free the gm port), it works fine. Note that the openib btl works just fine. Forgive me if this is a known problem, I am trying to catch up from my vacation... Tim --- If anyone cares, here is the callstack: (gdb) bt #0 0x404de825 in ?? () from /lib/libgcc_s.so.1 #1 0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690) at mpool_rdma_module.c:431 #2 0x400caca9 in mca_mpool_base_close () at base/ mpool_base_close.c:57 #3 0x40060094 in ompi_mpi_finalize () at runtime/ ompi_mpi_finalize.c:304 #4 0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44 #5 0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214 (gdb) gdb shows that at this point the gm btl is no longer loaded. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Problem in mpool rdma finalize
FWIW: we fixed this recently in the openib BTL by ensuring that all registered memory is freed during the BTL finalize (vs. the mpool finalize). This is a new issue because the mpool finalize was just recently expanded to un-register all of its memory as part of the NIC-restart effort (and will likely also be needed for checkpoint/restart...?). On Aug 13, 2007, at 9:11 AM, Tim Prins wrote: Hi folks, I have run into a problem with mca_mpool_rdma_finalize as implemented in r15557. With the t_win onesided test, running over gm, it segfaults. What appears to be happening is that some memory is registered with gm, and then gets freed by mca_mpool_rdma_finalize. But the free function that it is using is in the gm btl, and the btls are unloaded before the mpool is shut down. So the function call segfaults. If I change the code so we never unload the btls (and we don't free the gm port), it works fine. Note that the openib btl works just fine. Forgive me if this is a known problem, I am trying to catch up from my vacation... Tim --- If anyone cares, here is the callstack: (gdb) bt #0 0x404de825 in ?? () from /lib/libgcc_s.so.1 #1 0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690) at mpool_rdma_module.c:431 #2 0x400caca9 in mca_mpool_base_close () at base/ mpool_base_close.c:57 #3 0x40060094 in ompi_mpi_finalize () at runtime/ ompi_mpi_finalize.c:304 #4 0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44 #5 0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214 (gdb) gdb shows that at this point the gm btl is no longer loaded. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Problem in mpool rdma finalize
Hi folks, I have run into a problem with mca_mpool_rdma_finalize as implemented in r15557. With the t_win onesided test, running over gm, it segfaults. What appears to be happening is that some memory is registered with gm, and then gets freed by mca_mpool_rdma_finalize. But the free function that it is using is in the gm btl, and the btls are unloaded before the mpool is shut down. So the function call segfaults. If I change the code so we never unload the btls (and we don't free the gm port), it works fine. Note that the openib btl works just fine. Forgive me if this is a known problem, I am trying to catch up from my vacation... Tim --- If anyone cares, here is the callstack: (gdb) bt #0 0x404de825 in ?? () from /lib/libgcc_s.so.1 #1 0x4048081a in mca_mpool_rdma_finalize (mpool=0x925b690) at mpool_rdma_module.c:431 #2 0x400caca9 in mca_mpool_base_close () at base/mpool_base_close.c:57 #3 0x40060094 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:304 #4 0x4009a4c9 in PMPI_Finalize () at pfinalize.c:44 #5 0x08049946 in main (argc=1, argv=0xbfe16924) at t_win.c:214 (gdb) gdb shows that at this point the gm btl is no longer loaded.
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 4:06 AM, Pavel Shamis (Pasha) wrote: Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? My main objection is that the only reason you propose to do this is some bogus benchmark? Pallas, Presta (as i know) also use static rank. So lets start to fix all "bogus" benchmarks :-) ? Pasha. Why not: for (i=0; i < ITERATIONS; i++) { tag = i%MPI_TAG_UB; ... } On a related note, we have often discussed the fact that benchmarks only give an upper-bound on performance. I would expect that some users would want to also know the lower-bound. For example, set a flag that causes the benchmark to use a different buffer each time in order to cause the registration cache to miss. I am sure we could come up with some other cases. Scott
Re: [OMPI devel] openib btl header caching
Jeff Squyres wrote: With Mellanox's new HCA (ConnectX), extremely low latencies are possible for short messages between two MPI processes. Currently, OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel MPI, MVAPICH[2], etc.) are around 1.4us. A big reason for this difference is that, at least with MVAPICH[2], they are doing wire protocol header caching where the openib BTL does not. Specifically: - Mellanox tested MVAPICH with the header caching; latency was around 1.4us - Mellanox tested MVAPICH without the header caching; latency was around 1.9us Given that OMPI is the lone outlier around 1.9us, I think we have no choice except to implement the header caching and/or examine our header to see if we can shrink it. Mellanox has volunteered to implement header caching in the openib btl. Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? This sounds great. Sun, would like to hear how thing are being done so we can possibly port the solution to the udapl btl. --td
Re: [OMPI devel] openib btl header caching
On Mon, Aug 13, 2007 at 11:06:00AM +0300, Pavel Shamis (Pasha) wrote: > > > > >> Any objections? We can discuss what approaches we want to take > >> (there's going to be some complications because of the PML driver, > >> etc.); perhaps in the Tuesday Mellanox teleconf...? > >> > >> > > My main objection is that the only reason you propose to do this is some > > bogus benchmark? > > > Pallas, Presta (as i know) also use static rank. So lets start to fix > all "bogus" benchmarks :-) ? > All benchmarks are bogus. I have better optimization. Check a name of executable and if this is some know benchmark send one byte instead of real message. 3D driver do this why can't we. -- Gleb.
Re: [OMPI devel] openib btl header caching
Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? My main objection is that the only reason you propose to do this is some bogus benchmark? Pallas, Presta (as i know) also use static rank. So lets start to fix all "bogus" benchmarks :-) ? Pasha.