Re: [OMPI devel] oshmem test suite errors
On Feb 20, 2014, at 7:10 AM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > For all of these, I'm using the openshmem test suite that is now committed to > the ompi-svn SVN repo. I don't know if the errors are with the tests or with > oshmem itself. > > 1. I'm running the oshmem test suite at 32 processes across 2 16-core > servers. I'm seeing a segv in "examples/shmem_2dheat.x 10 10". It seems to > run fine at lower np values such as 2, 4, and 8; I didn't try to determine > where the crossover to badness occurs. My memory is bad and my notes are on a machine I no longer have access to, but I did this to the test suite run for Portals SHMEM: Index: shmem_2dheat.c === --- shmem_2dheat.c (revision 270) +++ shmem_2dheat.c (revision 271) @@ -129,6 +129,11 @@ p = _num_pes (); my_rank = _my_pe (); + if (p > 8) { + fprintf(stderr, "Ignoring test when run with more than 8 pes\n"); + return 77; + } + /* argument processing done by everyone */ int c, errflg; extern char *optarg; The commit comment was that there was a scaling issue in the code itself, I just wish I could remember exactly what it was. > 2. "examples/adjacent_32bit_amo.x 10 10" seems to hang with both tcp and > usnic BTLs, even when running at np=2 (I let it run for several minutes > before killing it). If atomics aren't fast, this test can run for a very long time (also, it takes no arguments, so the 10 10 is being ignored). It's essentially looking for a race by blasting 32-bit atomic ops at both parts of a 64 bit word. > 3. Ditto for "example/ptp.x 10 10". > > 4. "examples/shmem_matrix.x 10 10" seems to run fine at np=32 on usnic, but > hangs with TCP (i.e., I let it run for 8+ minutes before killing it -- > perhaps it would have finished eventually?). > > ...there's more results (more timeouts and more failures), but they're not > yet complete, and I've got to keep working on my own features for v1.7.5, so > I need to move to other things right now. These start to sound like issues in the code; those last two are pretty decent tests. > I think I have oshmem running well enough to add these to Cisco's nightly MTT > runs now, so the results will start showing up there without needing my > manual attention. Woot. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] RFC: new OMPI RTE define:
And what will you do for RTE components that aren't ORTE? This really isn't a feature of a run-time, so it doesn't seem like it should be part of the RTE interface... Brian On Feb 17, 2014, at 3:03 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > WHAT: New OMPI_RTE_EVENT_BASE define > > WHY: The usnic BTL needs to run some events asynchronously; the ORTE event > base already exists and is running asynchronously in MPI processes > > WHERE: in ompi/mca/rte/rte.h and rte_orte.h > > TIMEOUT: COB Friday, 21 Feb 2014 > > MORE DETAIL: > > The WHY line described it pretty well: we want to run some things > asynchronously in the usnic BTL and we don't really want to re-invent the > wheel (or add yet another thread in each MPI process). The ORTE event base > is already there, there's already a thread servicing it, and Ralph tells me > that it is safe to add our own events on to it. > > The patch below adds the new OMPI_RTE_EVENT_BASE #define. > > > diff --git a/ompi/mca/rte/orte/rte_orte.h b/ompi/mca/rte/orte/rte_orte.h > index 3c88c6d..3ceadb8 100644 > --- a/ompi/mca/rte/orte/rte_orte.h > +++ b/ompi/mca/rte/orte/rte_orte.h > @@ -142,6 +142,9 @@ typedef struct { > } ompi_orte_tracker_t; > OBJ_CLASS_DECLARATION(ompi_orte_tracker_t); > > +/* define the event base that the RTE exports */ > +#define OMPI_RTE_EVENT_BASE orte_event_base > + > END_C_DECLS > > #endif /* MCA_OMPI_RTE_ORTE_H */ > diff --git a/ompi/mca/rte/rte.h b/ompi/mca/rte/rte.h > index 69ad488..de10dff 100644 > --- a/ompi/mca/rte/rte.h > +++ b/ompi/mca/rte/rte.h > @@ -150,7 +150,9 @@ > *a. OMPI_DB_HOSTNAME > *b. OMPI_DB_LOCALITY > * > - * (g) Communication support > + * (g) Asynchronous / event support > + * 1. OMPI_RTE_EVENT_BASE - the libevent base that executes in a > + *separate thread > * > */ > > @@ -162,6 +164,7 @@ > #include "opal/dss/dss_types.h" > #include "opal/mca/mca.h" > #include "opal/mca/base/base.h" > +#include "opal/mca/event/event.h" > > BEGIN_C_DECLS > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] Updating the trunk
On Jun 30, 2009, at 5:57 PM, Ralph Castain wrote: If you are updating a prior checkout of the OMPI trunk with r21568, please be aware that there is an additional step required to make it build. Due to a quirk of the build system, you will need to do: rm ompi/tools/ompi_info/.deps/* and then re-run autogen/configure in order to build. The reason this is required is that the new ompi_info implementation generates .o files of the same name as the prior C++ implementation. As a result, the .deps files do not get updated - and therefore insist on looking for the old .cc files. Removing the .deps and re-running autogen/configure will resolve the problem. If you are doing a fresh checkout of the OMPI trunk, this will not affect you. Slightly less safe, but you can also do the rm command Ralph gave, followed by a "make -k", which will regenerate just that makefile, then update the .deps files, then build the sources. You probably want to do a plain-old make after that to make sure nothing failed in the build, as Make will report an error occurred during "make -k". Brian
Re: [OMPI devel] trac ticket 1944 and pending sends
Or go to what I proposed and USE A LINKED LIST! (as I said before, not an original idea, but one I think has merit) Then you don't have to size the fifo, because there isn't a fifo. Limit the number of send fragments any one proc can allocate and the only place memory can grow without bound is the OB1 unexpected list. Then use SEND_COMPLETE instead of SEND_NORMAL in the collectives without barrier semantics (bcast, reduce, gather, scatter) and you effectively limit how far ahead any one proc can get to something that we can handle, with no performance hit. Brian On Jun 24, 2009, at 12:46 AM, George Bosilca wrote: In other words, as long as a queue is peer based (peer not peers), the management of the pending send list was doing what it was supposed to, and there was no possibility of deadlock. With the new code, as a third party can fill up a remote queue, getting a fragment back [as you stated] became a poor indicator for retry. I don't see how the proposed solution will solve the issue without a significant overhead. As we only call the MCA_BTL_SM_FIFO_WRITE once before the fragment get into the pending list, reordering the fragments will not solve the issue. When the peers is overloaded, the fragments will end-up in the pending list, and there is nothing to get it out of there except a message from the peer. In some cases, such a message might never be delivered, simply because the peer doesn't have any data to send us. The other solution is to always check all pending lists. While this might work, it will certainly add undesirable overhead to the send path. You last patch was doing the right thing. Globally decreasing the size of the memory used by the MPI library is _the right_ way to go. Unfortunately, your patch only address this at the level of the shared memory file. Now, instead of using less memory we use even more because we have to store that data somewhere ... in the fragments returned by the btl_sm_alloc function. These fragments are allocated on demand and by default there is no limit to the number of such fragments. Here is a simple fix for both problems. Enforce a reasonable limit on the number of fragments in the BTL free list (1K should be more than enough), and make sure the fifo has a size equal to p * number_of_allowed_fragments_in_the_free_list, where p is the number of local processes. While this solution will certainly increase again the size of the mapped file, it will do it by a small margin compared with what is happening today in the code. This is without talking about the fact that it will solve the deadlock problem, by removing the inability to return a fragment. In addition, the PML is capable of handing such situations, so we're getting back to a deadlock free sm BTL. george. On Jun 23, 2009, at 11:04 , Eugene Loh wrote: The sm BTL used to have two mechanisms for dealing with congested FIFOs. One was to grow the FIFOs. Another was to queue pending sends locally (on the sender's side). I think the grow-FIFO mechanism was typically invoked and the pending-send mechanism used only under extreme circumstances (no more memory). With the sm makeover of 1.3.2, we dropped the ability to grow FIFOs. The code added complexity and there seemed to be no need to have two mechanisms to deal with congested FIFOs. In ticket 1944, however, we see that repeated collectives can produce hangs, and this seems to be due to the pending-send code not adequately dealing with congested FIFOs. Today, when a process tries to write to a remote FIFO and fails, it queues the write as a pending send. The only condition under which it retries pending sends is when it gets a fragment back from a remote process. I think the logic must have been that the FIFO got congested because we issued too many sends. Getting a fragment back indicates that the remote process has made progress digesting those sends. In ticket 1944, we see that a FIFO can also get congested from too many returning fragments. Further, with shared FIFOs, a FIFO could become congested due to the activity of a third-party process. In sum, getting a fragment back from a remote process is a poor indicator that it's time to retry pending sends. Maybe the real way to know when to retry pending sends is just to check if there's room on the FIFO. So, I'll try modifying MCA_BTL_SM_FIFO_WRITE. It'll start by checking if there are pending sends. If so, it'll retry them before performing the requested write. This should also help preserve ordering a little better. I'm guessing this will not hurt our message latency in any meaningful way, but I'll check this out. Meanwhile, I wanted to check in with y'all for any guidance you might have. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] CMR one-sided changes? (r21134)
Yeah, putting together a CMR is on the todo list :). Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On May 20, 2009, at 12:41, Jeff Squyres <jsquy...@cisco.com> wrote: Brian: can we CMR over your OSD changes from 30 Apr (r21134)? I have noticed an enormous performance difference between the pt2pt and rdma osc components when running the IMB-EXT benchmark over IB: - pt2pt: 11+ minutes - rdma: 43 seconds rdma is the default on the trunk, since r21134 (https://svn.open-mpi.org/trac/ompi/changeset/21134 ). pt2pt is still the default on v1.3. There's a conflict in ompi/mca/osc/rdma/osc_rdma_sync.c, so I don't quite know how to proceed... -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Warn user about deprecated MPI functionality and "wrong" compiler usage
I think care must be taken on this front. While I know we don't like to admit it, there is no reason the C compilers have to match, and indeed good reasons they might not. For example, at LANL, we frequently compiled OMPI with GCC, then fixed up the wrapper compilers to use Icc or whatever, to work around optimizer bugs. This is functionality I don't think should be lost just to warn about deprecated functions. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On May 18, 2009, at 1:34, Rainer Keller <kel...@ornl.gov> wrote: What: Warn user about deprecated MPI functionality and "wrong" compiler usage Why: Because deprecated MPI functions, are ... deprecated Where: On trunk When: Apply on trunk before branching for v1.5 (it is user-visible) Timeout: 1 weeks - May 26, 2009 after the teleconf. - I'd like to propose a patch that addresses two issues: - Users shoot themselves in the foot compiling with a different compiler than what was used to compile OMPI (think ABI) - The MPI-2.1 std. defines several functions to be deprecated. This will warn Open MPI users, when accessing deprecated functions, even giving a proper warning such as: "MPI_TYPE_HVECTOR is superseded by MPI_TYPE_CREATE_HVECTOR" Also, now we may _warn_ when using a different compiler (gcc vs. intel vs. pgcc) This is achieved using __opal_attribute_deprecated__ and obviously needs to be added into mpi.h, therefore being a user-visible change. This however has a few caveats: 1.) Having Open MPI compiled with gcc and having users compiling with another compiler, which is not supporting __attribute__((deprecated)) is going to be a problem 2.) The attribute is most useful, when having a proper description (as above) -- which requires support for the optional argument to __deprecate__. This feature is offered only in gcc>4.4 (see http://gcc.gnu.org/ml/gcc- patches/2009-04/msg00087.html). Therefore, I added a configure-check for the compiler's support of the optional argument. And we need to store, which compiler is used to compile Open MPI and at (user- app) compile-time again check (within mpi.h), which compiler (and version!) is being used. This is then compared at user-level compile-time. To prevent users getting swamped with error msg. this can be turned off using the configure-option: --enable-mpi-interface-warning which turns on OMPI_WANT_MPI_INTERFACE_WARNING (default: DISabled), as suggested by Jeff. The user can however override that with (check mpi2basic_tests): mpicc -DOMPI_WANT_MPI_INTERFACE_WARNING -c lalala.c lots of warnings follow Please take a look into: http://bitbucket.org/jsquyres/ompi-deprecated/ With best regards, Rainer PS: Also, we need to disable the warning, when building Open MPI itselve ;-) PPS: Thanks to Paul Hargrove and Dan Bonachea for the GASnet file portable_platform.h which offers the CPP magic to figure out compilers and esp. compiler-versions. -- --- - Rainer Keller, PhD Tel: +1 (865) 241-6293 Oak Ridge National Lab Fax: +1 (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Revise paffinity method?
Jumping in late (travelling this morning). I think this is the right answer :). Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On May 8, 2009, at 9:45, Ralph Castain <r...@open-mpi.org> wrote: I think that's the way to go then - it also follows our "the user is always right - even when they are wrong" philosophy. I'll probably have to draw on others to help ensure that the paffinity modules all report appropriately. Think I have enough now to start on this - probably middle of next week. Thanks! On May 8, 2009, at 8:37 AM, Jeff Squyres wrote: On May 8, 2009, at 10:32 AM, Ralph Castain wrote: Actually, I was wondering (hot tub thought for the night) if the paffinity system can't just tell us if the proc has been bound or not? That would remove the need for YAP (i.e., yet another param). Yes, it can. What it can't tell, though, is who set it. So a user may have overridden the paffinity after main() starts but before MPI_INIT is invoked. But perhaps that's not a crime -- users can override the paffinity at their own risk (we actually have no way to preventing them from doing so). So perhaps just checking if affinity is already set is a "good enough" mechanism for the MPI_INIT-set-paffinity logic to determine whether it should set affinity itself or not. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] require newer autoconf?
I'd rather not. I have a couple of platforms with 2.59 installed, but not 2.60+. I really don't want to have to install my own autotools because of some bug that doesn't affect me. I don't, however, have a problem with forcing users to upgrade in order to get support for build-related issues. The version of Autoconf used is in config.log, so it's not hard to find which version the user actually used. Brian On Mar 17, 2009, at 7:00 PM, Jeff Squyres wrote: Per this thread: http://www.open-mpi.org/community/lists/users/2009/03/8402.php It took a *lng* time to figure out that an outdated Autoconf install was the culprit of the "restrict" mess. The issue is that somewhere between v2.61 and v2.63, Autoconf changed the order of looking for "restrict"-like keywords -- AC 2.63 has the "good" order; AC 2.61 has the "bad" order (hence, PGI worked for me with AC 2.63, but barfed for Mostyn with AC 2.61). Should we have our autogen.sh force the use of AC 2.63 and above? (currently, it forces 2.59 and above) -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Move of ompi_bitmap_t
we need further discussions. --td Brian Barrett wrote: So once again, I bring up my objection of this entire line of moving until such time as the entire process is properly mapped out. I believe it's premature to being moving around code in preparation for a move that hasn't been proven viable yet. Until there is concrete evidence that such a move is possible, won't degrade application performance, and does not make the code totally unmaintainable, I believe that any related code changes should not be brought into the trunk. Brian On Jan 30, 2009, at 12:30 PM, Rainer Keller wrote: On behalf of Laurent Broto RFC: Move of ompi_bitmap_t WHAT: Move ompi_bitmap_t into opal or onet-layer WHY: Remove dependency on ompi-layer. WHERE: ompi/class WHEN: Open MPI-1.4 TIMEOUT: February 3, 2009. - Details: WHY: The ompi_bitmap_t is being used in various places within opal/orte/ompi. With the proposed splitting of BTLs into a separate library, we are currently investigating several of the differences between ompi/class/* and opal/class/* One of the items is the ompi_bitmap_t which is quite similar to the opal_bitmap_t. The question is, whether we can remove favoring a solution just in opal. WHAT: The data structures in the opal-version are the same, so is the interface, the implementation is *almost* the same The difference is the Fortran handles ;-]! Maybe we're missing something but could we have a discussion, on why Fortran sizes are playing a role here, and if this is a hard requirement, how we could settle that into that current interface (possibly without a notion of Fortran, but rather, set some upper limit that the bitmap may grow to?) With best regards, Laurent and Rainer -- Rainer Keller, PhD Tel: (865) 241-6293 Oak Ridge National Lab Fax: (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] RFC: Move of ompi_bitmap_t
In that case, I remove my objection to this particular RFC. It remains for all other RFCs related to moving any of the BTL move code to the trunk before the critical issues with the BTL move have been sorted out in a temporary branch. This includes renaming functions and such. Perhaps we should have a discussion about those issues during the Forum in a couple weeks? Brian On Feb 1, 2009, at 5:37 AM, Jeff Squyres wrote: I just looked through both opal_bitmap_t and ompi_bitmap_t and I think that the only real difference is that in the ompi version, we check (in various places) that the size of the bitmap never grows beyond OMPI_FORTRAN_HANDLE_MAX; the opal version doesn't do these kind of size checks. I think it would be fairly straightforward to: - add generic checks into the opal version, perhaps by adding a new API call (opal_bitmap_set_max_size()) - if the max size has been set, then ensure that the bitmap never grows beyond that size, otherwise let it have the same behavior as today (grow without bound -- assumedly until malloc() fails) It'll take a little care to ensure to merge the functionality correctly, but it is possible. Once that is done, you can: - remove the ompi_bitmap_t class - s/ompi_bitmap/opal_bitmap/g in the OMPI layer - add new calls to opal_bitmap_set_max_size(, OMPI_FORTRAN_HANDLE_MAX) in the OMPI layer (should only be in a few places -- probably one for each MPI handle type...? It's been so long since I've looked at that code that I don't remember offhand) I'd generally be in favor of this because, although this is not a lot of repeated code, it *is* repeated code -- so cleaning it up and consolidating the non-Fortran stuff down in opal is not a Bad Thing. On Jan 30, 2009, at 4:59 PM, Ralph Castain wrote: The history is simple. Originally, there was one bitmap_t in orte that was also used in ompi. Then the folks working on Fortran found that they had to put a limit in the bitmap code to avoid getting values outside of Fortran's range. However, this introduced a problem - if we had the limit in the orte version, then we limited ourselves unnecessarily, and introduced some abstraction questions since orte knows nothing about Fortran. So two were created. Then the orte_bitmap_t was blown away at a later time when we removed the GPR as George felt it wasn't necessary (which was true). It was later reborn when we needed it in the routed system, but this time it was done in opal as others indicated a potential more general use for that capability. The problem with uniting the two is that you either have to introduce Fortran-based limits into opal (which messes up the non- ompi uses), or deal with the Fortran limits in some other fashion. Neither is particularly pleasant, though it could be done. I think it primarily is a question for the Fortran folks to address - can they deal with Fortran limits in some other manner without making the code unmanageable and/or taking a performance hit? Ralph On Jan 30, 2009, at 2:40 PM, Richard Graham wrote: This should really be viewed as a code maintenance RFC. The reason this came up in the first place is because we are investigating the btl move, but these are really two very distinct issues. There are two bits of code that have virtually the same functionality - they do have the same interface I am told. The question is, is there a good reason to keep two different versions in the repository ? Not knowing the history of why a second version was created this is an inquiry. Is there some performance advantage, or some other advantage to having these two versions ? Rich On 1/30/09 3:23 PM, "Terry D. Dontje" <terry.don...@sun.com> wrote: I second Brian's concern. So unless this is just an announcement that this is being done on a tmp branch only until everything is in order I think we need further discussions. --td Brian Barrett wrote: So once again, I bring up my objection of this entire line of moving until such time as the entire process is properly mapped out. I believe it's premature to being moving around code in preparation for a move that hasn't been proven viable yet. Until there is concrete evidence that such a move is possible, won't degrade application performance, and does not make the code totally unmaintainable, I believe that any related code changes should not be brought into the trunk. Brian On Jan 30, 2009, at 12:30 PM, Rainer Keller wrote: On behalf of Laurent Broto RFC: Move of ompi_bitmap_t WHAT: Move ompi_bitmap_t into opal or onet-layer WHY: Remove dependency on ompi-layer. WHERE: ompi/class WHEN: Open MPI-1.4 TIMEOUT: February 3, 2009. - Details: WHY: The ompi_bitmap_t is being used in various places within opal/orte/ompi. With the proposed splitting of BTLs into a separate library, we are currently investigating several of
Re: [OMPI devel] RFC: Move of ompi_bitmap_t
So once again, I bring up my objection of this entire line of moving until such time as the entire process is properly mapped out. I believe it's premature to being moving around code in preparation for a move that hasn't been proven viable yet. Until there is concrete evidence that such a move is possible, won't degrade application performance, and does not make the code totally unmaintainable, I believe that any related code changes should not be brought into the trunk. Brian On Jan 30, 2009, at 12:30 PM, Rainer Keller wrote: On behalf of Laurent Broto RFC: Move of ompi_bitmap_t WHAT: Move ompi_bitmap_t into opal or onet-layer WHY: Remove dependency on ompi-layer. WHERE: ompi/class WHEN: Open MPI-1.4 TIMEOUT: February 3, 2009. - Details: WHY: The ompi_bitmap_t is being used in various places within opal/orte/ ompi. With the proposed splitting of BTLs into a separate library, we are currently investigating several of the differences between ompi/class/* and opal/class/* One of the items is the ompi_bitmap_t which is quite similar to the opal_bitmap_t. The question is, whether we can remove favoring a solution just in opal. WHAT: The data structures in the opal-version are the same, so is the interface, the implementation is *almost* the same The difference is the Fortran handles ;-]! Maybe we're missing something but could we have a discussion, on why Fortran sizes are playing a role here, and if this is a hard requirement, how we could settle that into that current interface (possibly without a notion of Fortran, but rather, set some upper limit that the bitmap may grow to?) With best regards, Laurent and Rainer -- Rainer Keller, PhD Tel: (865) 241-6293 Oak Ridge National Lab Fax: (865) 241-4811 PO Box 2008 MS 6164 Email: kel...@ornl.gov Oak Ridge, TN 37831-2008AIM/Skype: rusraink ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] RFC: sm Latency
I unfortunately don't have time to look in depth at the patch. But my concern is that currently (today, not at some made up time in the future, maybe), we use the BTLs for more than just MPI point-to- point. The rdma one-sided component (which was added for 1.3 and hopefully will be the default for 1.4) sends messages directly over the btls. It would be interesting to know how that is handled. Brian On Jan 20, 2009, at 6:53 PM, Jeff Squyres wrote: This all sounds really great to me. I agree with most of what has been said -- e.g., benchmarks *are* important. Improving them can even sometimes have the side effect of improving real applications. ;-) My one big concern is the moving of architectural boundaries of making the btl understand MPI match headers. But even there, I'm torn: 1. I understand why it is better -- performance-wise -- to do this. And the performance improvement results are hard to argue with. We took a similar approach with ORTE; ORTE is now OMPI-specific, and many, many things have become better (from the OMPI perspective, at least). 2. We all have the knee-jerk reaction that we don't want to have the BTLs know anything about MPI semantics because they've always been that way and it has been a useful abstraction barrier. Now there's even a project afoot to move the BTLs out into a separate later that cannot know about MPI (so that other things can be built upon it). But are we sacrificing potential MPI performance here? I think that's one important question. Eugene: you mentioned that there are other possibilities to having the BTL understand match headers, such as a callback into the PML. Have you tried this approach to see what the performance cost would be, perchance? I'd like to see George's reaction to this RFC, and Brian's (if he has time). On Jan 20, 2009, at 8:04 PM, Eugene Loh wrote: Patrick Geoffray wrote: Eugene Loh wrote: replace the fifo’s with a single link list per process in shared memory, with senders to this process adding match envelopes atomically, with each process reading its own link list (multiple *) Doesn't strike me as a "simple" change. Actually, it's much simpler than trying to optimize/scale the N^2 implementation, IMHO. 1) The version I talk about is already done. Check my putbacks. "Already done" is easier! :^) 2) The two ideas are largely orthogonal. The RFC talks about a variety of things: cleaning up the sendi function, moving the sendi call up higher in the PML, bypassing the PML receive-request structure (similar to sendi), and stream-lining the data convertors in common cases. Only one part of the RFC (directed polling) overlaps with having a single FIFO per receiver. *) Not sure this addresses all-to-all well. E.g., let's say you post a receive for a particular source. Do you then wade through a long FIFO to look for your match? The tradeoff is between demultiplexing by the sender, which cost in time and in space, or by the receiver, which cost an atomic inc. ANY_TAG forces you to demultiplex on the receive side anyway. Regarding all-to-all, it won't be more expensive if the receives are pre- posted, and they should be. Not sure I understand this paragraph. I do, however, think there are great benefits to the single-receiver-queue model. It implies congestion on the receiver side in the many-to-one case, but if a single receiver is reading all those messages anyhow, message-processing is already going to throttle the message rate. The extra "bottleneck" at the FIFO might never be seen. What the RFC talks about is not the last SM development we'll ever need. It's only supposed to be one step forward from where we are today. The "single queue per receiver" approach has many advantages, but I think it's a different topic. But is this intermediate step worth it or should we (well, you :-) ) go directly for the single queue model ? To recap: 1) The work is already done. 2) The single-queue model addresses only one of the RFC's issues. 3) I'm a fan of the single-queue model, but it's just a separate discussion. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.3 PML default choice
George - I don't care what we end up doing, but what you state is wrong. We do not use the CM for all other MTLs by default. PSM is the *ONLY* MTL that will cause CM to be used by default. Portals still falls back to OB1 by default. Again, don't care, don't want to change, just want the documentation and current behavior to match. Brian On Jan 13, 2009, at 6:27 PM, George Bosilca wrote: This topic was raised on the mailing list quite a few times. There is a major difference between the PSM and the MX support. For PSM there is just an MTL, which makes everything a lot simpler. The problem with MX is that we have an MTL and a BTL. In order to figure out which one to use, we have to call the init function and this function initialize MX. The MTL use the default values for this, while the BTL give some hints to the MX library (about how to behave based on the support level we want, i.e. such as who will deal with shared memory or self communications). As there can be only one MX initialization, as the MTL initialize first, the BTL will always get a wrongly initialized MX library (which can generate some performance problems). What Brian describe is the best compromise we manage to find few months ago. If you want to get the MX CM to run, you will have to clearly specify on the command line --mca pml cm. All other MTL will have the behavior described on the README. george. On Jan 13, 2009, at 20:18 , Brian Barrett wrote: On Jan 13, 2009, at 5:48 PM, Patrick Geoffray wrote: Jeff Squyres wrote: Gaah! I specifically asked Patrick and George about this and they said that the README text was fine. Grr... When I looked at that time, I vaguely remember that _both_ PMLs were initialized but CM was eventually used because it was the last one. It looked broken, but it worked in the end (MTL was used with CM PML). I don't know if that behavior changed since. I just tested 1.3rc4 with MX and it uses the btl by default. The reason is the cm init lowers the priority to 1 unless the MTL that loaded is psm, in which case it stays at the higher default of 30. It's a fairly easy fix, I think. But the last time this was discussed people in the group had objections to using the MTL by default with MX. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.3 PML default choice
On Jan 13, 2009, at 5:48 PM, Patrick Geoffray wrote: Jeff Squyres wrote: Gaah! I specifically asked Patrick and George about this and they said that the README text was fine. Grr... When I looked at that time, I vaguely remember that _both_ PMLs were initialized but CM was eventually used because it was the last one. It looked broken, but it worked in the end (MTL was used with CM PML). I don't know if that behavior changed since. I just tested 1.3rc4 with MX and it uses the btl by default. The reason is the cm init lowers the priority to 1 unless the MTL that loaded is psm, in which case it stays at the higher default of 30. It's a fairly easy fix, I think. But the last time this was discussed people in the group had objections to using the MTL by default with MX. Brian
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Sorry, I really won't have time to look until after Christmas. I'll put it on the to-do list, but that's as soon as it has a prayer of reaching the top. Brian On Dec 13, 2008, at 1:02 PM, George Bosilca wrote: Brian, I found a second problem with rebuilding the datatype on the remote. Originally, the displacement were wrongly computed. This is now fixed. However, the data at the end of the fence is still not correct on the remote. I can confirm that the packed message contains only 0 instead of the real value, but I couldn't figure out how these 0 got there. The pack function works correctly for the MPI_Send function, I don't see any reason not to do the same for the MPI_Put. As you're the one- sided guy in ompi, can you take a look at the MPI_Put to see why the data is incorrect? george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrause <doriankra...@web.de> Date: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users <us...@open-mpi.org> Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrause <doriankra...@web.de> Date: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users <us...@open-mpi.org> Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrauseDate: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ompitest.tar.gz Description: GNU Zip compressed data ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users
Re: [OMPI devel] memcpy MCA framework
I obviously won't be in Dublin (I'll be in a fishing boat in the middle of nowhere Canada -- much better), so I'm going to chime in now. The m4 part actually isn't too bad and is pretty simple. I'm not sure other than looking at some variables set by ompi_config_asm that there is much to check. The hard parts are dealing with the finer grained instruction set requirements. On x86 in particular, many of the operations in the memcpy are part of SSE, SSE2, or SSE3. Currently, we don't have any finer concept of a processor than x86 and most compilers target an instruction set that will run on anything considered 686, which is almost everything out there. We'd have to decide how to handle instruction streams which are no longer going to work on every chip. Since we know we have a number of users with heterogeneous x86 clusters, this is something to think about. Brian On Aug 17, 2008, at 7:57 AM, Jeff Squyres wrote: Let's talk about this in Dublin. I can probably help with the m4 magic, but I need to understand exactly what needs to be done first. On Aug 16, 2008, at 11:51 AM, Terry Dontje wrote: George Bosilca wrote: The intent of the memcpy framework is to allow a selection between several memcpy at runtime. Of course, there will be a preselection at compile time, but all versions that can compile on a given architecture will be benchmarked at runtime and the best one will be selected. There is a file with several versions of memcpy for x86 (32 and 64) somewhere around (I should have one if interested), that can be used as a starting point. Ok, I guess I need to look at this code. I wonder if there may be cases for Sun's machines in which this benchmark could end up picking the wrong memcpy? The only thing we need is a volunteer to build the m4 magic. Figuring out what we can compile if kind of tricky, as some of the functions are in assembly, some others in C, and some others a mixture (the MMX headers). Isn't the atomic code very similar? If I get to this point before anyone else I probably will volunteer. --td george. On Aug 16, 2008, at 3:19 PM, Terry Dontje wrote: Hi Tim, Thanks for bringing the below up and asking for a redirection to the devel list. I think looking/using the MCA memcpy framework would be a good thing to do and maybe we can work on this together once I get out from under some commitments. However, some of the challenges that originally scared me away from looking at the memcpy MCA is whether we really want all the OMPI memcpy's to be replaced or just specific ones. Also, I was concerned on trying to figure out which version of memcpy I should be using. I believe currently things are done such that you get one version based on which system you compile on. For Sun there may be several different SPARC platforms that would need to use different memcpy code but we would like to just ship one set of bits. Not saying the above not doable under the memcpy MCA framework just that it somewhat scared me away from thinking about it at first glance. --td Date: Fri, 15 Aug 2008 12:08:18 -0400 From: "Tim Mattox"Subject: Re: [OMPI users] SM btl slows down bandwidth? To: "Open MPI Users" Message-ID: Content-Type: text/plain; charset=ISO-8859-1 Hi Terry (and others), I have previously explored this some on Linux/X86-64 and concluded that Open MPI needs to supply it's own memcpy routine to get good sm performance, since the memcpy supplied by glibc is not even close to optimal. We have an unused MCA framework already set up to supply an opal_memcpy. AFAIK, George and Brian did the original work to set up that framework. It has been on my to-do list for awhile to start implementing opal_memcpy components for the architectures I have access to, and to modify OMPI to actually use opal_memcpy where ti makes sense. Terry, I presume what you suggest could be dealt with similarly when we are running/building on SPARC. Any followup discussion on this should probably happen on the developer mailing list. On Thu, Aug 14, 2008 at 12:19 PM, Terry Dontje wrote: > Interestingly enough on the SPARC platform the Solaris memcpy's actually use > non-temporal stores for copies >= 64KB. By default some of the mca > parameters to the sm BTL stop at 32KB. I've done experimentations of > bumping the sm segment sizes to above 64K and seen incredible speedup on our > M9000 platforms. I am looking for some nice way to integrate a memcpy that > lowers this boundary to 32KB or lower into Open MPI. > I have not looked into whether Solaris x86/x64 memcpy's use the non-temporal > stores or not. > > --td >> >> Message: 1 >> Date: Thu, 14 Aug 2008 09:28:59 -0400 >> From: Jeff Squyres >> Subject: Re: [OMPI users] SM btl slows
Re: [OMPI devel] if btl->add_procs() fails...?
On Aug 4, 2008, at 9:40 AM, Jeff Squyres wrote: On Aug 2, 2008, at 2:34 PM, Brian Barrett wrote: I am curious how all of the above affects client/server or spawned jobs. If you finalize a BTL then do a connect to a process that would use that BTL would it reinitialize itself? To deal with all the dynamics issues, I wouldn't finalized the BTL. The BML should handle the progress stuff, just as if the add_procs succeeded but returned no active peers. But I'd guess that's part of the bit that doesn't work today. I would further suspect that a BTL will need to have a working progress function in the face of add_procs failures to cope with all the dynamics options. I'm travelling this weekend, so I can't verify any of this at the moment. This seems a little different than the rest of the code base -- we're talking about having the BTL return an error but have the upper level not treat it as a fatal error. I think we actually have a few different situations ("fail" means "not returning OMPI_SUCCESS"): 1. btl component init fails (only during MPI_INIT): the API supports no notion of failure -- it either returns modules or not (i.e., valid pointers or NULL). If NULL is returned, the component is ignored and unloaded. 2. btl add_procs during MPI_INIT fails: this is under debate 3. btl add_procs during MPI-2 dynamics fails: this is under debate For #2 and #3, I suspect that only the BTL knows if it can continue or not. For example, a failure in #3 may cause the entire BTL to be hosed such that it can no longer communicate with procs that it previously successfully added (e.g., in MPI_INIT). So we really need add_procs to be able to return multiple things: A. OMPI_SUCCESS / all was good B. a non-fatal error occurred such that this BTL cannot communicate with the desired peers, but the upper level PML can continue C. a fatal error has occurred such that the upper level should abort (or, more specifically, do whatever the error manager says) I think that for B in both #2 and #3, we can just have the BTL set all the reachability bits to 0 and return OMPI_SUCCESS. But for C, the BTL should return != OMPI_SUCCESS. The PML should treat it as a fatal error and therefore call the error manager. I think that this is in-line with Brian's original comments, right? I suppose, but that's a pain when you just want to say "I don't support calling add_procs a second time" :). But I'm not going to fix all the BTLs to make that work right, so I suppose in the end I really don't have a strong opinion. Brian
Re: [OMPI devel] if btl->add_procs() fails...?
My thought is that if add_procs fails, then that BTL should be removed (as if init failed) and things should continue on. If that BTL was the only way to reach another process, we'll catch that later and abort. There are always going to be errors that can't be detected until the device is actually used, so I think that add_procs errors should be treated the same as init errors. The error_cb is a red herring, as that's supposed to be used in situations where an error can't directly be returned to the upper layers (like the progress function). In this case, we can directly return an error, so we should do so (and I believe we do, it's the BML/PML that's the problem). Brian On Aug 1, 2008, at 8:03 PM, Jeff Squyres wrote: I wasted a bunch of time today debugging a scenario where openib- >add_procs() was (legitimately) failing during MPI_INIT. Specifically: an openib BTL module had successfully been initialized, but then was failing during add_procs(). I say "legitimately" failing because something external was causing add_procs to fail (i.e., a misconfiguration on my cluster). By "fail", I mean add_procs() returned != OMPI_SUCCESS. The problem is that OMPI does not handle this situation gracefully; every MPI process dumped core. My question is: what exactly should happen when BTL add_procs() fails? Is the BTL expected to recover? What if the BTL has no procs as a result of this failure; should the PML (or BML) remove it from progress loops? Or should the BTL be able to handle if progress is called on its component? (which seems kinda wasteful) Or should the failure of add_procs() be a fatal error? If so, what should the BTL do? The PML's error_cb has not yet been registered, and returning != OMPI_SUCCESS does not [currently] cause the PML to abort. This fact seems to indicate to me that the PML/BTL designers envisioned that the MPI process should be able to continue. But I'm not sure that I agree with that assessment: we have a successfully initialized BTL module, but an error occurred during add_procs(). Shouldn't we gracefully abort? My $0.02: - if the BTL returns != OMPI_SUCCESS from add_procs(), the PML should gracefully abort. - if a BTL fails add_procs() in a non-fatal way, it can set all reachable bits to 0 and return OMPI_SUCCESS. The PML will therefore effectively ignore it. Comments? I'd like to fix the openib btl's add_procs() one way or another for v1.3. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Need v1.3 RM ruling (was: Help on building openmpi with "-Wl, --as-needed -Wl, --no-undefined")
George - When I looked at the same problem in LAM, I couldn't get the dependencies right between libraries when only one makefile was used. It worked up until I would do a parallel build, then would die because the libraries weren't ready at the right time. There's probably a way, but I ended up with Jeff's approach. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On Jul 24, 2008, at 2:23, George Bosilca <bosi...@eecs.utk.edu> wrote: I tend to agree with Brian's comments, I would like to see this pushed into the 1.3 branch asap. However, I'm concerned with the impact on the code base of the required modifications as described on the TRAC ticket and on the email thread. I wonder if we cannot use the same technique that we use for improving the build time, i.e. getting information from the Makefile.am in the subdirs and adding it in the upper level Makefile.am. As an example for the F77 build tree: - if we create the following directories structure: -> ompi -> mpi -> f77 -> global (this is new and will contain the 5 files actually in the f77 base) -> profiling - then we include in the ompi/Makefile.am: include mpi/f77/global/ Makefile.am - and in the mpi/f77/global/Makefile.am we add the 5 C files in the SOURCES. - the compiling of the f77 bindings and profiling information will then depend on the libmpi, as long as we enforce the buildinf of the f77 library after the libmpi.so. With this approach, all files related to f77 will stay in the f77 directory (and the same will apply for cxx and f90), and the required modifications to the makefiles are minimal. Auto* gurus would such a solution works ? Thanks, george. On Jul 23, 2008, at 6:52 PM, Brian Barrett wrote: First, sorry about the previous message - I'm incapable of using my e-mail apparently. Based on discusions with people when this came up for LAM, it sounds like this will become common for the next set of major releases from the distros. The feature is fairly new to GNU ld, but has some nice features for the OS, which I don't totally understand. Because this problem will only become more common during the lifespan of 1.3.x , it makes sense to put this in v1.3, in my opinion. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On Jul 23, 2008, at 9:32, Jeff Squyres <jsquy...@cisco.com> wrote: Release managers: I have created ticket 1409 for this issue. I need a ruling: do you want this fixed for v1.3? https://svn.open-mpi.org/trac/ompi/ticket/1409 PRO: It's not too heinous to fix, but it does require moving some code around. CON: This is the first time anyone has ever run into this issue. ???: I don't know if this is a trend where distros will start wanting to compile with -Wl,--no-undefined. On Jul 23, 2008, at 10:15 AM, Jeff Squyres wrote: On Jul 23, 2008, at 10:08 AM, Ralf Wildenhues wrote: Is the attached patch what you're talking about? If so, I'll commit to trunk, v1.2, and v1.3. Can you verify that it work with a pristine build? The dependencies as such look sane to me, also the cruft removal, but I fail to see how your directory ordering can work: You're right; I tested only in an already-built tree. I also didn't run "make install" to an empty tree, which also shows the problem. Let me twonk around with this a bit... -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Need v1.3 RM ruling (was: Help on building openmpi with "-Wl, --as-needed -Wl, --no-undefined")
First, sorry about the previous message - I'm incapable of using my e- mail apparently. Based on discusions with people when this came up for LAM, it sounds like this will become common for the next set of major releases from the distros. The feature is fairly new to GNU ld, but has some nice features for the OS, which I don't totally understand. Because this problem will only become more common during the lifespan of 1.3.x , it makes sense to put this in v1.3, in my opinion. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On Jul 23, 2008, at 9:32, Jeff Squyres <jsquy...@cisco.com> wrote: Release managers: I have created ticket 1409 for this issue. I need a ruling: do you want this fixed for v1.3? https://svn.open-mpi.org/trac/ompi/ticket/1409 PRO: It's not too heinous to fix, but it does require moving some code around. CON: This is the first time anyone has ever run into this issue. ???: I don't know if this is a trend where distros will start wanting to compile with -Wl,--no-undefined. On Jul 23, 2008, at 10:15 AM, Jeff Squyres wrote: On Jul 23, 2008, at 10:08 AM, Ralf Wildenhues wrote: Is the attached patch what you're talking about? If so, I'll commit to trunk, v1.2, and v1.3. Can you verify that it work with a pristine build? The dependencies as such look sane to me, also the cruft removal, but I fail to see how your directory ordering can work: You're right; I tested only in an already-built tree. I also didn't run "make install" to an empty tree, which also shows the problem. Let me twonk around with this a bit... -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Need v1.3 RM ruling (was: Help on building openmpi with "-Wl, --as-needed -Wl, --no-undefined")
Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. On Jul 23, 2008, at 9:32, Jeff Squyres <jsquy...@cisco.com> wrote: Release managers: I have created ticket 1409 for this issue. I need a ruling: do you want this fixed for v1.3? https://svn.open-mpi.org/trac/ompi/ticket/1409 PRO: It's not too heinous to fix, but it does require moving some code around. CON: This is the first time anyone has ever run into this issue. ???: I don't know if this is a trend where distros will start wanting to compile with -Wl,--no-undefined. On Jul 23, 2008, at 10:15 AM, Jeff Squyres wrote: On Jul 23, 2008, at 10:08 AM, Ralf Wildenhues wrote: Is the attached patch what you're talking about? If so, I'll commit to trunk, v1.2, and v1.3. Can you verify that it work with a pristine build? The dependencies as such look sane to me, also the cruft removal, but I fail to see how your directory ordering can work: You're right; I tested only in an already-built tree. I also didn't run "make install" to an empty tree, which also shows the problem. Let me twonk around with this a bit... -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Two large patches in trunk
Hi all - I just pushed two rather large patches into the trunk for v1.3 (with George and Brad's blessing). First, the ptmalloc2 changes are in the trunk. So going forward, ptmalloc2 will not be linked into the application binary by default. You will have to set libs to -lopenmpi-malloc to get ptmalloc2. However, leave_pinned will turn on mallopt by default now, so for most users there will be no visible change. There is also a configure flag if users really want the old behavior. Nothing substantial has changed on this front since my more detailed e-mail last week. Second, Open MPI now provides the option of using perl-based wrapper compilers instead of the traditional C based ones. The Perl based ones do not have nearly as much functionality as the C based ones, lacking multilib, installdirs, and multi-project (ie opalcc/ortecc) support (in addition to not supporting many of the -showme options). The C versions are still the default and are intended to remain that way for the foreseeable future. The Perl compilers are intended to be used for cross-compile installs, which seems to be the bulk of my use of Open MPI these days. Specifying --enable-script-wrapper-compilers will disable the C based compilers and enable the Perl based compilers. Finally, --enable-script-wrapper-compilers combined with -- disable-binaries will still result in the Perl based wrapper compilers being installed. As always, let me know what I broke. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] Memory hooks change testing
Did anyone get a chance to test (or think about testing) this? I'd like to commit the changes on Friday evening, if I haven't heard any negative feedback. Brian On Jun 9, 2008, at 8:56 PM, Brian Barrett wrote: Hi all - Per the RFC I sent out last week, I've implemented a revised behavior of the memory hooks for high speed networks. It's a bit different than the RFC proposed, but still very minor and fairly straight foward. The default is to build ptmalloc2 support, but as an almost complete standalone library. If the user wants to use ptmalloc2, he only has to add -lopenmpi-malloc to his link line. Even when standalone and openmpi-malloc is not linked in, we'll still intercept munmap as it's needed for mallopt (below) and we've never had any trouble with that part of ptmalloc2 (it's easy to intercept). As a *CHANGE* in behavior, if leave_pinned support is turned on and there's no ptmalloc2 support, we will automatically enable mallopt. As a *CHANGE* in behavior, if the user disables mallopt or mallopt is not available and leave pinned is requested, we'll abort. I think these both make sense and are closest to expected behavior, but wanted to point them out. It is possible for the user to disable mallopt and enable leave_pinned, but that will *only* work if there is some other mechanism for intercepting free (basically, it allows a way to ensure your using ptmalloc2 instead of mallopt). There is also a new memory component, mallopt, which only intercepts munmap and exists only to allow users to enable mallopt while not building the ptmalloc2 component at all. Previously, our mallopt support was lacking in that it didn't cover the case where users explicitly called munmap in their applications. Now, it does. The changes are fairly small and can be seen/tested in the HG repository bwb/mem-hooks, URL below. I think this would be a good thing to push to 1.3, as it will solve an ongoing problem on Linux (basically, users getting screwed by our ptmalloc2 implementation). http://www.open-mpi.org/hg/hgwebdir.cgi/bwb/mem-hooks/ Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/ -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] Notes from mem hooks call today
On May 28, 2008, at 5:09 PM, Roland Dreier wrote: I think Patrick's point is that it's not too much more expensive to do the syscall on Linux vs just doing the cache lookup, particularly in the context of a long message. And it means that upper layer protocols like MPI don't have to deal with caches (and since MPI implementors hate registration caches only slightly less than we hate MPI_CANCEL, that will make us happy). Stick in a separate library then? I don't think we want the complexity in the kernel -- I personally would argue against merging it upstream; and given that the userspace solution is actually faster, it becomes pretty hard to justify. If someone would like to pull registration cache into OFED, that would be great. But something tells me they won't want to. It's a pain, it screws up users, and it only works about 50% of the time. It's a support issue -- pushing it in a separate library doesn't help anyone unless someone's willing to handle the support. I sure as heck don't want to do the support anymore, particularly since OFED is the *ONLY* major software stack that requires such evil hacks. MX handles it at the lower layer. Portals is specified such that the hardware and/or Portals library must handle it (by specifying semantics that require registration per message). Quadrics (with tports) handles it in a combination of the kernel and library. TCP doesn't require pinning and/or registration. Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] undefined references forrdma_get_peer_addr & rdma_get_local_addr
hat others are aware) >>> Crud. Can you send me your config.log? I don't know why it's able >>> to find rdma_get_peer_addr() in configure, but then later not able >>> to find it during the build - I'd like to see what happened >>> during configure. >>> On May 2, 2008, at 7:09 PM, Pak Lui wrote: >>>> Hi Jeff, >>>> >>>> It seems that the cpc3 merge causes my Ranger build to break. I >>>> believe it is using OFED 1.2 but I don't know how to check. It >>>> passes the ompi_check_openib.m4 that you added in for the >>>> rdma_get_peer_addr. Is there a missing #include for openib/ofed >>>> related somewhere? >>>> >>>> >>>> 1236 checking rdma/rdma_cma.h usability... yes >>>> 1237 checking rdma/rdma_cma.h presence... yes >>>> 1238 checking for rdma/rdma_cma.h... yes >>>> 1239 checking for rdma_create_id in -lrdmacm... yes >>>> 1240 checking for rdma_get_peer_addr... yes >>>> >>>> >>>> pgCC -DHAVE_CONFIG_H -I. -I../../../../ompi/tools/ompi_info - >>>> I../../../opal/include -I../../../orte/include -I../../../ompi/ >>>> include -I../../../opal/mca/paffinity/linux/plpa/src/libplpa - >>>> DOMPI_CONFIGURE_USER="\"paklui\"" - >>>> DOMPI_CONFIGURE_HOST="\"login4.ranger.tacc.utexas.edu\"" - >>>> DOMPI_CONFIGURE_DATE="\"Fri May 2 17:07:01 CDT 2008\"" - >>>> DOMPI_BUILD_USER="\"$USER\"" -DOMPI_BUILD_HOST="\"`hostname`\"" - >>>> DOMPI_BUILD_DATE="\"`date`\"" -DOMPI_BUILD_CFLAGS="\"-O -DNDEBUG >>>> \"" -DOMPI_BUILD_CPPFLAGS="\"-I../../../.. -I../../.. - >>>> I../../../../ opal/include -I../../../../orte/include - >>>> I../../../../ompi/include - D_REENTRANT\"" - >>>> DOMPI_BUILD_CXXFLAGS="\"-O -DNDEBUG \"" - >>>> DOMPI_BUILD_CXXCPPFLAGS="\"-I../../../.. -I../../.. - I../../../../ >>>> opal/include -I../../../../orte/include -I../../../../ompi/ >>>> include - D_REENTRANT\"" -DOMPI_BUILD_FFLAGS="\"\"" - >>>> DOMPI_BUILD_FCFLAGS="\"\"" -DOMPI_BUILD_LDFLAGS="\" \"" - >>>> DOMPI_BUILD_LIBS="\"-lnsl -lutil -lpthread\"" - >>>> DOMPI_CC_ABSOLUTE="\"/opt/apps/pgi/7.1/linux86-64/7.1-2/bin/pgcc >>>> \"" - DOMPI_CXX_ABSOLUTE="\"/opt/apps/pgi/7.1/linux86-64/7.1-2/ bin/ >>>> pgCC\"" -DOMPI_F77_ABSOLUTE="\"/opt/apps/pgi/7.1/ linux86-64/7.1-2/ >>>> bin/ pgf77\"" -DOMPI_F90_ABSOLUTE="\"/opt/apps/pgi/7.1/ >>>> linux86-64/7.1-2/ bin/pgf95\"" -DOMPI_F90_BUILD_SIZE="\"small \"" - >>>> I../../../.. - I../../.. -I../../../../opal/include - I../../../../ >>>> orte/include - I../../../../ompi/include -D_REENTRANT -O - >>>> DNDEBUG -c -o version.o ../../../../ompi/tools/ompi_info/ >>>> version.cc >>>> /bin/sh ../../../libtool --tag=CXX --mode=link pgCC -O - DNDEBUG >>>> - o ompi_info components.o ompi_info.o output.o param.o >>>> version.o ../../../ompi/libmpi.la -lnsl -lutil -lpthread >>>> libtool: link: pgCC -O -DNDEBUG -o .libs/ompi_info components.o >>>> ompi_info.o output.o param.o version.o ../../../ompi/.libs/ >>>> libmpi.so -L/opt/ofed/lib64 -libcm -lrdmacm -libverbs -lrt / share/ >>>> home/00951/paklui/ompi-trunk5/config-data1/orte/.libs/libopen- >>>> rte.so /share/home/00951/paklui/ompi-trunk5/config-data1/ >>>> opal/.libs/ libopen-pal.so -lnuma -ldl -lnsl -lutil -lpthread - >>>> Wl,--rpath -Wl,/ share/home/00951/paklui/ompi-trunk5/shared- >>>> install1/lib >>>> >>>> [1]Exit 2make install >& >>>> make.install.log.0 >>>> ../../../ompi/.libs/libmpi.so: undefined reference to >>>> `rdma_get_peer_addr' >>>> ../../../ompi/.libs/libmpi.so: undefined reference to >>>> `rdma_get_local_addr' >>>> make[2]: *** [ompi_info] Error 2 >>>> make[2]: Leaving directory `/share/home/00951/paklui/ompi-trunk5/ >>>> config-data1/ompi/tools/ompi_info' >>>> make[1]: *** [install-recursive] Error 1 >>>> make[1]: Leaving directory `/share/home/00951/paklui/ompi-trunk5/ >>>> config-data1/ompi' >>>> make: *** [install-recursive] Error 1 >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> - Pak Lui >>>> pak@sun.com >> >> -- >> >> >> - Pak Lui >> pak@sun.com >> > > -- - Pak Lui pak@sun.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] FreeBSD timer_base_open error?
On Mar 25, 2008, at 6:16 PM, Jeff Squyres wrote: "linux" is the name of the component. It looks like opal/mca/timer/ linux/timer_linux_component.c is doing some checks during component open() and returning an error if it can't be used (e.g,. if it's not on linux). The timer components are a little different than normal MCA frameworks; they *must* be compiled in libopen-pal statically, and there will only be one of them built. In this case, I'm guessing that linux was built simply because nothing else was selected to be built, but then its component_open() function failed because it didn't find /proc/cpuinfo. This is actually incorrect. The linux component looks for /proc/ cpuinfo and builds if it founds that file. There's a base component that's built if nothing else is found. The configure logic for the linux component is probably not the right thing to do -- it should probably be modified to check both for that file (there are systems that call themselves "linux" but don't have a /proc/cpuinfo) is readable and that we're actually on Linux. Brian -- Brian Barrett There is an art . . . to flying. The knack lies in learning how to throw yourself at the ground and miss. Douglas Adams, 'The Hitchhikers Guide to the Galaxy'
Re: [OMPI devel] [RFC] Non-blocking collectives (LibNBC) merge to trunk
Let me start by reminding everyone that I have no vote, so this should probably be sent to /dev/null. I think Ralph raised some good points. I'd like to raise another. Does it make sense to bring LibNBC into the release at this point, given the current standardization process of non-blocking collectives? My feeling is no, based on the long term support costs. We had this problem with a function in LAM/MPI -- MPIL_SPAWN, I believe it was -- that was almost but not quite MPI_COMM_SPAWN. It was added to allow spawn before the standard was finished for dynamics. The problem is, it wasn't quite MPI_COMM_SPAWN, so we were now stuck with yet another function to support (in a touchy piece of code) for infinity and beyond. I worry that we'll have the same with LibNBC -- a piece of code that solves an immediate problem (no non-blocking collectives in MPI) but will become a long-term support anchor. Since this is something we'll be encouraging users to write code to, it's not like support for mvapi, where we can just deprecate it and users won't really notice. It's one thing to tell them to update their cluster software stack -- it's another to tell them to rewrite their applications. Just my $0.02, Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] SDP support for OPEN-MPI
if (ORTE_SUCCESS != rc && + (EAFNOSUPPORT != opal_socket_errno || + mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_CONNECT)) { + opal_output(0, + "mca_oob_tcp_init: unable to create IPv6 listen socket: %s\n", + opal_strerror(rc)); + } #endif + } if (mca_oob_tcp_component.tcp_debug >= OOB_TCP_DEBUG_INFO) { opal_output(0, "%s accepting connections via event library", ORTE_NAME_PRINT(orte_process_info.my_name)); Index: orte/mca/oob/tcp/oob_tcp.h === --- orte/mca/oob/tcp/oob_tcp.h (revision 17027) +++ orte/mca/oob/tcp/oob_tcp.h (working copy) @@ -217,6 +217,9 @@ inttcp6_port_min;/**< Minimum allowed port for the OOB listen socket */ inttcp6_port_range; /**< Range of allowed TCP ports */ #endif /* OPAL_WANT_IPV6 */ +#if OPAL_WANT_SDP +intsdp_enable; /**< support for SDP */ +#endif /* OAP_WANT_SDP */ opal_mutex_t tcp_lock; /**< lock for accessing module state */ opal_list_ttcp_events; /**< list of pending events (accepts) */ opal_list_ttcp_msg_post; /**< list of recieves user has posted */ Thanks, Verkhovsky Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Brian Barrett Open MPI developer http://www.open-mpi.org/
Re: [OMPI devel] collective problems
As it stands today, the problem is that we can inject things into the BTL successfully that are not injected into the NIC (due to software flow control). Once a message is injected into the BTL, the PML marks completion on the MPI request. If it was a blocking send that got marked as complete, but the message isn't injected into the NIC/NIC library, and the user doesn't re-enter the MPI library for a considerable amount of time, then we have a problem. Personally, I'd rather just not mark MPI completion until a local completion callback from the BTL. But others don't like that idea, so we came up with a way for back pressure from the BTL to say "it's not on the wire yet". This is more complicated than just not marking MPI completion early, but why would we do something that helps real apps at the expense of benchmarks? That would just be silly! Brian On Nov 7, 2007, at 7:56 PM, Richard Graham wrote: Does this mean that we don’t have a queue to store btl level descriptors that are only partially complete ? Do we do an all or nothing with respect to btl level requests at this stage ? Seems to me like we want to mark things complete at the MPI level ASAP, and that this proposal is not to do that – is this correct ? Rich On 11/7/07 11:26 PM, "Jeff Squyres"wrote: On Nov 7, 2007, at 9:33 PM, Patrick Geoffray wrote: >> Remember that this is all in the context of Galen's proposal for >> btl_send() to be able to return NOT_ON_WIRE -- meaning that the send >> was successful, but it has not yet been sent (e.g., openib BTL >> buffered it because it ran out of credits). > > Sorry if I miss something obvious, but why does the PML has to be > aware > of the flow control situation of the BTL ? If the BTL cannot send > something right away for any reason, it should be the responsibility > of > the BTL to buffer it and to progress on it later. That's currently the way it is. But the BTL currently only has the option to say two things: 1. "ok, done!" -- then the PML will think that the request is complete 2. "doh -- error!" -- then the PML thinks that Something Bad Happened(tm) What we really need is for the BTL to have a third option: 3. "not done yet!" So that the PML knows that the request is not yet done, but will allow other things to progress while we're waiting for it to complete. Without this, the openib BTL currently replies "ok, done!", even when it has only buffered a message (rather than actually sending it out). This optimization works great (yeah, I know...) except for apps that don't dip into the MPI library frequently. :-\ -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] bml_btl->btl_alloc() instead of mca_bml_base_alloc() in OSC
Can't think of any good reason -- the patch should be fine. Thanks, Brian On Oct 28, 2007, at 7:13 AM, Gleb Natapov wrote: Hi Brian, Is there a special reason why you call btl functions directly instead of using bml wrappers? What about applying this patch? diff --git a/ompi/mca/osc/rdma/osc_rdma_component.c b/ompi/mca/osc/ rdma/osc_rdma_component.c index 2d0dc06..302dd9e 100644 --- a/ompi/mca/osc/rdma/osc_rdma_component.c +++ b/ompi/mca/osc/rdma/osc_rdma_component.c @@ -1044,9 +1044,8 @@ rdma_send_info_send(ompi_osc_rdma_module_t *module, ompi_osc_rdma_rdma_info_header_t *header = NULL; bml_btl = peer_send_info->bml_btl; -descriptor = bml_btl->btl_alloc(bml_btl->btl, -MCA_BTL_NO_ORDER, - sizeof(ompi_osc_rdma_rdma_info_header_t)); +mca_bml_base_alloc(bml_btl, , MCA_BTL_NO_ORDER, +sizeof(ompi_osc_rdma_rdma_info_header_t)); if (NULL == descriptor) { ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE; goto cleanup; diff --git a/ompi/mca/osc/rdma/osc_rdma_data_move.c b/ompi/mca/osc/ rdma/osc_rdma_data_move.c index e9fd17c..e7b5813 100644 --- a/ompi/mca/osc/rdma/osc_rdma_data_move.c +++ b/ompi/mca/osc/rdma/osc_rdma_data_move.c @@ -454,10 +454,10 @@ ompi_osc_rdma_sendreq_send(ompi_osc_rdma_module_t *module, /* get a buffer... */ endpoint = (mca_bml_base_endpoint_t*) sendreq- >req_target_proc->proc_bml; bml_btl = mca_bml_base_btl_array_get_next( >btl_eager); -descriptor = bml_btl->btl_alloc(bml_btl->btl, -MCA_BTL_NO_ORDER, -module->m_use_buffers ? bml_btl->btl_eager_limit : needed_len < bml_btl->btl_eager_limit ? needed_len : -bml_btl->btl_eager_limit); +mca_bml_base_alloc(bml_btl, , MCA_BTL_NO_ORDER, +module->m_use_buffers ? bml_btl->btl_eager_limit : +needed_len < bml_btl->btl_eager_limit ? needed_len : +bml_btl->btl_eager_limit); if (NULL == descriptor) { ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE; goto cleanup; @@ -698,9 +698,8 @@ ompi_osc_rdma_replyreq_send(ompi_osc_rdma_module_t *module, /* Get a BTL and a fragment to go with it */ endpoint = (mca_bml_base_endpoint_t*) replyreq->rep_origin_proc- >proc_bml; bml_btl = mca_bml_base_btl_array_get_next(>btl_eager); -descriptor = bml_btl->btl_alloc(bml_btl->btl, -MCA_BTL_NO_ORDER, -bml_btl->btl_eager_limit); +mca_bml_base_alloc(bml_btl, , MCA_BTL_NO_ORDER, +bml_btl->btl_eager_limit); if (NULL == descriptor) { ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE; goto cleanup; @@ -1260,9 +1259,8 @@ ompi_osc_rdma_control_send(ompi_osc_rdma_module_t *module, /* Get a BTL and a fragment to go with it */ endpoint = (mca_bml_base_endpoint_t*) proc->proc_bml; bml_btl = mca_bml_base_btl_array_get_next(>btl_eager); -descriptor = bml_btl->btl_alloc(bml_btl->btl, -MCA_BTL_NO_ORDER, - sizeof(ompi_osc_rdma_control_header_t)); +mca_bml_base_alloc(bml_btl, , MCA_BTL_NO_ORDER, +sizeof(ompi_osc_rdma_control_header_t)); if (NULL == descriptor) { ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE; goto cleanup; @@ -1322,9 +1320,8 @@ ompi_osc_rdma_rdma_ack_send(ompi_osc_rdma_module_t *module, ompi_osc_rdma_control_header_t *header = NULL; /* Get a BTL and a fragment to go with it */ -descriptor = bml_btl->btl_alloc(bml_btl->btl, -rdma_btl->rdma_order, - sizeof(ompi_osc_rdma_control_header_t)); +mca_bml_base_alloc(bml_btl, , rdma_btl->rdma_order, +sizeof(ompi_osc_rdma_control_header_t)); if (NULL == descriptor) { ret = OMPI_ERR_TEMP_OUT_OF_RESOURCE; goto cleanup; -- Gleb.
Re: [OMPI devel] problem in runing MPI job through XGrid
XGrid does not forward X11 credentials, so you would have to setup an X11 environment by yourself. Using ssh or a local starter does forward X11 credentials, which is why it works in that case. Brian On Oct 25, 2007, at 10:23 PM, Jinhui Qin wrote: Hi Brian, I got another problem in running an MPI job through XGrid. During the execution of this MPI job it will call Xlib functions (i.e. XOpenDisplay()) to open an X window. The XOpenDisplay() function call failed (return "null"), it can not open a display no matter how many processors that I requested. However, when I tuned off the xgrid controller, I used "mpirun -n 4 " to start the job again, four X windows opened properly, but four processes were all running on the local machine instead of on any remote nodes. I have also tested to use "ssh -x" from a terminal of my local machine to login to any other node in the cluster to run the job (I have the copies of the same job on all nodes and in the same path), the X window can display on my local machine properly. I know it is "-x" option set up the environment properly for starting the xwindow. If only use "ssh" without "-x" option, it won't work. I am wondering why the xwindow can not open if the job is started through Xgrid. How does the Xgrid controller contact to each agent node? Is there anyone who has seen a similar problem? I have installed X11 and OpenMPI on all 8 mac mini nodes in my cluster, and have also tested running an MPI job, which has no X window function calls, through XGrid, it worked perfectly fine on all nodes. Thanks a lot for any suggestions! Jane ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] PML cm and heterogeneous support
No, it's because the CM PML was never designed to be used in a heterogeneous environment :). While the MX BTL does support heterogeneous operations (at one point, I believe I even had it working), none of the MTLs have ever been tested in heterogeneous environments and it's known the datatype usage in the CM PML won't support heterogeneous operation. Brian On Oct 24, 2007, at 6:21 PM, Jeff Squyres wrote: George / Patrick / Rich / Christian -- Any idea why that's there? Is that because portals, MX, and PSM all require homogeneous environments? On Oct 18, 2007, at 3:59 PM, Sajjad Tabib wrote: Hi, I am tried to run an MPI program in a heterogeneous environment using the pml cm component. However, open mpi returned with an error message indicating that PML add procs returned "Not supported". I dived into the cm code to see what was wrong and I came upon the code below, which basically shows that if the processes are running on different architectures, then return "not supported". Now, I'm wondering whether my interpretation is correct or not. Is it true that the cm component does not support a heterogeneous environment? If so, will the developers support this in the future? How could I get around this while still using the cm component? What will happen if I rebuilt openmpi without these statements? I would appreciate your help. Code: mca_pml_cm_add_procs(){ #if OMPI_ENABLE_HETEROGENEOUS_SUPPORT 107 for (i = 0 ; i < nprocs ; ++i) { 108 if (procs[i]->proc_arch != ompi_proc_local()- >proc_arch) { 109 return OMPI_ERR_NOT_SUPPORTED; 110 } 111 } 112 #endif . . . } Sajjad Tabib ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 10:58 AM, Patrick Geoffray wrote: Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. We have had portability problems using ptmalloc2 in MPICH-GM, specially relative to threads. In MX, we choose to use dlmalloc instead. It is not as optimized and its thread-safety has a coarser grain, but it is much more portable. Disabling the memory manager in OpenMPI is not a bad thing for MX, as its own dlmalloc-based registration cache will operate transparently with MX_RCACHE=1 (default). If you're not packaging Open MPI with MX support, I'd configure Open MPI with the extra parameters: --without-memory-manager --enable-mca-static=btl-mx,mtl-mx This will provide the least possibility of something getting in the way of MX doing its thing with its memory hooks. It causes libmpi.so to depend on libmyriexpress.so, which is both a good and bad thing. Good because the malloc hooks in libmyriexpress aren't "seen" when we dlopen the OMPI MX drivers to suck in libmyriexpress, but they would be with this configuration. Bad in that libmpi.so now depends on libmyriexpress, so packaging for multiple machines could be more difficult. Brian
Re: [OMPI devel] RFC: versioning OMPI libraries
BTW, Here's the documentation I was referring to: http://www.gnu.org/software/libtool/manual.html#Versioning Now, the problem Open MPI faces is that while our MPI interface rarely changes (and almost never in a backwards-incompatible way), the interface between components and libraries does. So that could cause some interesting heartaches. Good luck, Brian On Oct 15, 2007, at 1:56 PM, Jeff Squyres wrote: Ok, having read the libtool docs now, I see why the release number is a bad idea. :-) I'm assuming that: - The libmpi interface will rarely change, but we may add to it over time (there's a specific point about this in the libtool docs -- no problem) - The libopen-rte interface historically has had major changes between major releases and may have interface changes between minor releases, too - The libopen-pal interface is relatively stable -- I actually haven't been checking how often it changes So if we do this, I think the RM's would need to be responsible for marshaling this information and setting the appropriate values. I can convert the build system to do use this kind of information; the real question is whether the community wants to utilize it or not (and whether the RM's will take on the responsibility of gathering this data for each release). On Oct 15, 2007, at 1:16 PM, Christian Bell wrote: On Mon, 15 Oct 2007, Brian Barrett wrote: No! :) It would be good for everyone to read the Libtool documentation to see why versioning on the release number would be a really bad idea. Then comment. But my opinion would be that you should change based on interface changes, not based on release numbers. Yes, I second Brian. Notwithstanding what the popular vote is on MPI ABI compatibility across MPI implementations, I think that major/minor numbering within an implementation should be used to indiciate when interfaces break, not give hints as to what release they pertain to. . . christian -- christian.b...@qlogic.com (QLogic Host Solutions Group, formerly Pathscale) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: versioning OMPI libraries
No! :) It would be good for everyone to read the Libtool documentation to see why versioning on the release number would be a really bad idea. Then comment. But my opinion would be that you should change based on interface changes, not based on release numbers. Brian On Oct 15, 2007, at 12:29 PM, Jeff Squyres wrote: WHAT: Add versioning to all OMPI libraries so that shared libraries use the real version number in the filename (vs. the current "*.so. 0.0.0") WHY: It's a Good Thing(tm) to do. WHERE: Minor changes in a few Makefile.am's; probably some small tweaking to top-level configure.ac and/or some support m4 files. WHEN: After timeout. TIMEOUT: COB, Tuesday Oct 23rd, 2007 - Currently, all OMPI shared libraries are created with the extension ".so.0.0.0". We have long discussed using Libtool properly to use a real/meaningful version number instead of "0.0.0" but no one has ever gotten a round tuit. I propose that v1.3 is [finally] the time to do this properly. I'm trolling through the configure/build system for a few other issues; I could pick this up along the way. My specific proposal is that all shared libraries be suffixed the numeric version number of Open MPI itself. For example, the first release that uses this functionality will have libmpi.so.1.3.0. Note that this still does not enable installing multiple versions of OMPI into the same prefix (for lots of other reasons not covered here), but at least it does allow multiple libraries in the same tree for backwards binary compatibility issues, and gives a visual reference of the library's version number in its filename. DSOs will remain un-suffixed (e.g., mca_btl_openib.so). -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] problem in runing MPI job through XGrid
On Oct 4, 2007, at 3:06 PM, Jinhui Qin wrote: sib:sharcnet$ mpirun -n 3 ~/openMPI_stuff/Hello Process 0.1.1 is unable to reach 0.1.2 for MPI communication. If you specified the use of a BTL component, you may have forgotten a component (such as "self") in the list of usable components. This is very odd -- it looks like two of the processes don't think they can talk to each other. Can you try running with: mpirun -n 3 -mca btl tcp,self If that fails, then the next piece of information that would be useful is the IP addresses and netmasks for all the nodes in your cluster. We have some logic in our TCP communication system that can cause some interesting results for some network topologies. Just to verify it's not an XGrid problem, you might want to try running with a hostfile -- I think you'll find that the results are the same, but it's always good to verify. Brian
Re: [OMPI devel] OpenIB component static compile failures
By the way, I filed a bug on this issue: https://svn.open-mpi.org/trac/ompi/ticket/1155 Brian On Oct 2, 2007, at 8:57 AM, Brian Barrett wrote: No, actually, my report isn't about that issue at all. I'm not talking about making an entirely statically built application. I'm talking about a statically compiled Open MPI with a dynamically linked application and OFED. Take a look at the output of mpicc - showme -- it's not adding *ANY* -l or -L options for InfiniBand. This is something wrong with Open MPI's configure, which has changed since v1.2. On v1.2, the same commands result in: [10:54] brbarret@odin:pts/27 v1.2> mpicc -showme gcc -I/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/ include -pthread -L/usr/local/ofed/lib64 -L/u/brbarret/Software/ x86_64-unknown-linux-gnu/ompi/devel/lib -lmpi -lopen-rte -lopen-pal -libverbs -lrt -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl [10:55] brbarret@odin:pts/27 examples> make ring_c mpicc -gring_c.c -o ring_c [10:55] brbarret@odin:pts/27 examples> Brian On Oct 1, 2007, at 10:21 PM, Jeff Squyres wrote: This is a known issue; no one had expressed any desire to have it fixed: http://www.open-mpi.org/faq/?category=mpi-apps#static-mpi-apps http://www.open-mpi.org/faq/?category=mpi-apps#static-ofa-mpi-apps Feel free to file a ticket and fix if you'd like... On Oct 1, 2007, at 11:56 PM, Brian Barrett wrote: Hi all - There's a problem with the OpenIB components when statically linking. For whatever reason, the configure logic is not adding the right -L and -l flags to the mpicc wrapper flags. [17:26] brbarret@odin:pts/8 examples> mpicc -showme gcc -I/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/ include -pthread -L/u/brbarret/Software/x86_64-unknown-linux-gnu/ ompi/ devel/lib -lmpi -lopen-rte -lopen-pal -lnuma -ldl -Wl,--export- dynamic -lnsl -lutil -lm -ldl [17:42] brbarret@odin:pts/8 examples> make hello_c mpicc -ghello_c.c -o hello_c /u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/lib/ libmpi.a (btl_openib_component.o)(.text+0x895): In function `openib_reg_mr': /u/brbarret/odin/ompi/trunk/ompi/mca/btl/openib/ btl_openib_component.c:304: undefined reference to `ibv_reg_mr' and many more, obviously. Good luck, Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] OpenIB component static compile failures
No, actually, my report isn't about that issue at all. I'm not talking about making an entirely statically built application. I'm talking about a statically compiled Open MPI with a dynamically linked application and OFED. Take a look at the output of mpicc - showme -- it's not adding *ANY* -l or -L options for InfiniBand. This is something wrong with Open MPI's configure, which has changed since v1.2. On v1.2, the same commands result in: [10:54] brbarret@odin:pts/27 v1.2> mpicc -showme gcc -I/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/ include -pthread -L/usr/local/ofed/lib64 -L/u/brbarret/Software/ x86_64-unknown-linux-gnu/ompi/devel/lib -lmpi -lopen-rte -lopen-pal - libverbs -lrt -lnuma -ldl -Wl,--export-dynamic -lnsl -lutil -lm -ldl [10:55] brbarret@odin:pts/27 examples> make ring_c mpicc -gring_c.c -o ring_c [10:55] brbarret@odin:pts/27 examples> Brian On Oct 1, 2007, at 10:21 PM, Jeff Squyres wrote: This is a known issue; no one had expressed any desire to have it fixed: http://www.open-mpi.org/faq/?category=mpi-apps#static-mpi-apps http://www.open-mpi.org/faq/?category=mpi-apps#static-ofa-mpi-apps Feel free to file a ticket and fix if you'd like... On Oct 1, 2007, at 11:56 PM, Brian Barrett wrote: Hi all - There's a problem with the OpenIB components when statically linking. For whatever reason, the configure logic is not adding the right -L and -l flags to the mpicc wrapper flags. [17:26] brbarret@odin:pts/8 examples> mpicc -showme gcc -I/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/ include -pthread -L/u/brbarret/Software/x86_64-unknown-linux-gnu/ ompi/ devel/lib -lmpi -lopen-rte -lopen-pal -lnuma -ldl -Wl,--export- dynamic -lnsl -lutil -lm -ldl [17:42] brbarret@odin:pts/8 examples> make hello_c mpicc -ghello_c.c -o hello_c /u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/lib/libmpi.a (btl_openib_component.o)(.text+0x895): In function `openib_reg_mr': /u/brbarret/odin/ompi/trunk/ompi/mca/btl/openib/ btl_openib_component.c:304: undefined reference to `ibv_reg_mr' and many more, obviously. Good luck, Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] OpenIB component static compile failures
Hi all - There's a problem with the OpenIB components when statically linking. For whatever reason, the configure logic is not adding the right -L and -l flags to the mpicc wrapper flags. [17:26] brbarret@odin:pts/8 examples> mpicc -showme gcc -I/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/ include -pthread -L/u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/ devel/lib -lmpi -lopen-rte -lopen-pal -lnuma -ldl -Wl,--export- dynamic -lnsl -lutil -lm -ldl [17:42] brbarret@odin:pts/8 examples> make hello_c mpicc -ghello_c.c -o hello_c /u/brbarret/Software/x86_64-unknown-linux-gnu/ompi/devel/lib/libmpi.a (btl_openib_component.o)(.text+0x895): In function `openib_reg_mr': /u/brbarret/odin/ompi/trunk/ompi/mca/btl/openib/ btl_openib_component.c:304: undefined reference to `ibv_reg_mr' and many more, obviously. Good luck, Brian
Re: [OMPI devel] Malloc segfaulting?
On Sep 20, 2007, at 7:02 AM, Tim Prins wrote: In our nightly runs with the trunk I have started seeing cases where we appear to be segfaulting within/below malloc. Below is a typical output. Note that this appears to only happen on the trunk, when we use openib, and are in 32 bit mode. It seems to happen randomly at a very low frequency (59 out of about 60,000 32 bit openib runs). This could be a problem with our machine, and has showed up since I started testing 32bit ofed 10 days ago. Anyways, just curious if anyone had any ideas. As someone else said, this usually points to a duplicate free or the like in malloc. You might want to try compiling with --without- memory-manager, as the ptmalloc2 in glibc frequently is more verbose about where errors occurred than is the one in Open MPI. Brian
Re: [OMPI devel] FreeBSD Support?
On Sep 19, 2007, at 4:11 PM, Tim Prins wrote: Here is where it gets nasty. On FreeBSD, /usr/include/string.h includes strings.h in some cases. But there is a strings.h in the ompi/mpi/f77 directory, so that is getting included instead of the proper /usr/include/strings.h. I suppose we could rename our strings.h to f77_strings.h, or something similar. Does anyone have an opinion on this? I think this is the best path forward. Ugh. Brian
Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?
On Aug 28, 2007, at 9:05 AM, Li-Ta Lo wrote: On Mon, 2007-08-27 at 15:10 -0400, Rolf vandeVaart wrote: We are running into a problem when running on one of our larger SMPs using the latest Open MPI v1.2 branch. We are trying to run a job with np=128 within a single node. We are seeing the following error: "SM failed to send message due to shortage of shared memory." We then increased the allowable maximum size of the shared segment to 2Gigabytes-1 which is the maximum allowed on 32-bit application. We used the mca parameter to increase it as shown here. -mca mpool_sm_max_size 2147483647 This allowed the program to run to completion. Therefore, we would like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes. Does anyone have an objection to this change? Soon we are going to have larger CPU counts and would like to increase the odds that things work "out of the box" on these large SMPs. There is a serious problem with the 1.2 branch, it does not allocate any SM area for each process at the beginning. SM areas are allocated on demand and if some of the processes are more aggressive than the others, it will cause starvation. This problem is fixed in the trunk by assign at least one SM area for each process. I think this is what you saw (starvation) and an increase of max size may not be necessary. Although I'm pretty sure this is fixed in the v1.2 branch already. I don't think we should raise that ceiling at this point. We create the file in /tmp, and if someone does -np 32 on a single, small node (not unheard of), it'll do really evil things. Personally, I don't think we need nearly as much shared memory as we're using. It's a bad design in terms of its unbounded memory usage. We should fix that, rather than making the file bigger. But I'm not going to fix it, so take my opinion with a grain of salt. Brian
Re: [OMPI devel] ompi_mpi_abort
On Aug 25, 2007, at 7:10 AM, Jeff Squyres wrote: 1. We have logic in ompi_mpi_abort to prevent recursive invocation (ompi_mpi_abort.c:60): /* Protection for recursive invocation */ if (have_been_invoked) { return OMPI_SUCCESS; } have_been_invoked = true; This, IMHO, is a wrong thing to do. The intent of ompi_mpi_abort() was that it never returned. But now it is? That seems wrong to me. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer
On Aug 24, 2007, at 9:08 AM, George Bosilca wrote: By heterogeneous RTE I was talking about what will happened once we have the RSL. Different back-end will support different features, so from the user perspective we will not provide a homogeneous execution environment in all situations. On the other hand, focusing our efforts in ORTE will guarantee this homogeneity in all cases. Is this a good thing? I think no, and we already don't have it. On Cray, we don't use mpirun but yod. Livermore wants us to use SLURM directly instead of our mpirun kludge. Those are heterogeneous from the user perspective. But are also what the user expects on those platforms. Brian
Re: [OMPI devel] Question on NX bit patch in Debian
You are correct -- we now add a .note.GNU-stack to the assembly file if the assembler supports it, so that patch should no longer be needed. Brian On Aug 18, 2007, at 9:03 AM, Manuel Prinz wrote: Hi everyone, in the Debian package of OpenMPI there has been a patch [1] for some time which I think is obsolete. I did some reading on that topic but I'm not very familiar with assembler, so I'm asking you here. As far as I can see, removing the patch doesn't change the binaries much. Neither scanelf nor readelf show something I'd consider as suspicious. I think that the .note.GNU-stack instruction is added to the assembler files by generate-asm.pl, so everything's set properly. But as I said, I'm not very familiar with the matter and it would be great to get a statement on that issue from you. (We could drop a rather large patch along with this one, if it's obsolete.) Thanks in advance! Best regards Manuel Footnote: 1. http://svn.debian.org/wsvn/pkg-openmpi/openmpi/trunk/debian/ patches/10opal_noexecstack.dpatch?op=file=0=0 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r15903
Fixed. Sorry about the configure change mid-day, but it seemed like the right thing to do. Brian On Aug 17, 2007, at 10:37 AM, Brian Barrett wrote: Oh, crud. I forgot to fix that issue. Will fix asap. Brian On Aug 17, 2007, at 10:12 AM, George Bosilca wrote: This patch break the trunk. It looks like the LT_PACKAGE_VERSION wasn't defined before the 2.x version. The autogen fails with the following error: *** Running GNU tools [Running] autom4te --language=m4sh ompi_get_version.m4sh -o ompi_get_version.sh [Running] aclocal configure.ac:998: error: m4_defn: undefined macro: LT_PACKAGE_VERSION configure.ac:998: the top level autom4te: /usr/bin/m4 failed with exit status: 1 aclocal: autom4te failed with exit status: 1 george. On Aug 17, 2007, at 12:08 AM, brbar...@osl.iu.edu wrote: Author: brbarret Date: 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) New Revision: 15903 URL: https://svn.open-mpi.org/trac/ompi/changeset/15903 Log: Support versions of the Libtool 2.1a snapshots after the lt_dladvise code was brought in. This supercedes the GLOBL patch that we had been using with Libtool 2.1a versions prior to the lt_dladvise code. Autogen tries to figure out which version you're on, so either will now work with the trunk. Text files modified: trunk/configure.ac |18 + +++-- trunk/opal/mca/base/mca_base_component_find.c | 8 + +++ trunk/opal/mca/base/mca_base_component_repository.c |24 + +++ 3 files changed, 48 insertions(+), 2 deletions(-) Modified: trunk/configure.ac = = --- trunk/configure.ac (original) +++ trunk/configure.ac 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -995,10 +995,15 @@ ompi_show_subtitle "Libtool configuration" +m4_if(m4_version_compare(m4_defn([LT_PACKAGE_VERSION]), 2.0), -1, [ AC_LIBLTDL_CONVENIENCE(opal/libltdl) AC_LIBTOOL_DLOPEN AC_PROG_LIBTOOL - +], [ +LT_CONFIG_LTDL_DIR([opal/libltdl], [subproject]) +LTDL_CONVENIENCE +LT_INIT([dlopen win32-dll]) +]) ompi_show_subtitle "GNU libltdl setup" # AC_CONFIG_SUBDIRS appears to be broken for non-gcc compilers (i.e., @@ -1038,6 +1043,13 @@ if test "$HAPPY" = "1"; then LIBLTDL_SUBDIR=libltdl +CPPFLAGS_save="$CPPFLAGS" +CPPFLAGS="-I." +AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h], +[OPAL_HAVE_LTDL_ADVISE=1], +[OPAL_HAVE_LTDL_ADVISE=0]) +CPPFLAGS="$CPPFLAGS" + # Arrgh. This is gross. But I can't think of any other way to do # it. :-( @@ -1057,7 +1069,7 @@ AC_MSG_WARN([libltdl support disabled (by --disable-dlopen)]) LIBLTDL_SUBDIR= -LIBLTDL= +OPAL_HAVE_LTDL_ADVISE=0 # append instead of prepend, since LIBS are going to be system # type things needed by everyone. Normally, libltdl will push @@ -1073,6 +1085,8 @@ AC_DEFINE_UNQUOTED(OMPI_WANT_LIBLTDL, $OMPI_ENABLE_DLOPEN_SUPPORT, [Whether to include support for libltdl or not]) +AC_DEFINE_UNQUOTED(OPAL_HAVE_LTDL_ADVISE, $OPAL_HAVE_LTDL_ADVISE, +[Whether libltdl appears to have the lt_dladvise interface]) ## # visibility Modified: trunk/opal/mca/base/mca_base_component_find.c = = --- trunk/opal/mca/base/mca_base_component_find.c (original) +++ trunk/opal/mca/base/mca_base_component_find.c 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -75,6 +75,10 @@ char name[MCA_BASE_MAX_COMPONENT_NAME_LEN]; }; typedef struct ltfn_data_holder_t ltfn_data_holder_t; + +#if OPAL_HAVE_LTDL_ADVISE +extern lt_dladvise opal_mca_dladvise; +#endif #endif /* OMPI_WANT_LIBLTDL */ @@ -387,7 +391,11 @@ /* Now try to load the component */ +#if OPAL_HAVE_LTDL_ADVISE + component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise); +#else component_handle = lt_dlopenext(target_file->filename); +#endif if (NULL == component_handle) { err = strdup(lt_dlerror()); if (0 != show_errors) { Modified: trunk/opal/mca/base/mca_base_component_repository.c = = --- trunk/opal/mca/base/mca_base_component_repository.c (original) +++ trunk/opal/mca/base/mca_base_component_repository.c 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -85,6 +85,10 @@ static repository_item_t *find_component(const char *type, const char *name); static int link_items(repository_item_t *src, repository_item_t *depend); +#if OPAL_HAVE_LTDL_ADVISE +lt_dladvise opal_mca_dladvise; +#endif + #endif /* OMPI_WANT_LIBLTDL */ @@ -103,6 +107,20 @@ return OPAL_ERR_OUT_OF_RESOURCE; } +#if OPAL_HAVE_LTDL_ADVISE +if (lt_dladvise_init(_mca_dladvise)) { +retu
Re: [OMPI devel] [OMPI svn] svn:open-mpi r15903
Oh, crud. I forgot to fix that issue. Will fix asap. Brian On Aug 17, 2007, at 10:12 AM, George Bosilca wrote: This patch break the trunk. It looks like the LT_PACKAGE_VERSION wasn't defined before the 2.x version. The autogen fails with the following error: *** Running GNU tools [Running] autom4te --language=m4sh ompi_get_version.m4sh -o ompi_get_version.sh [Running] aclocal configure.ac:998: error: m4_defn: undefined macro: LT_PACKAGE_VERSION configure.ac:998: the top level autom4te: /usr/bin/m4 failed with exit status: 1 aclocal: autom4te failed with exit status: 1 george. On Aug 17, 2007, at 12:08 AM, brbar...@osl.iu.edu wrote: Author: brbarret Date: 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) New Revision: 15903 URL: https://svn.open-mpi.org/trac/ompi/changeset/15903 Log: Support versions of the Libtool 2.1a snapshots after the lt_dladvise code was brought in. This supercedes the GLOBL patch that we had been using with Libtool 2.1a versions prior to the lt_dladvise code. Autogen tries to figure out which version you're on, so either will now work with the trunk. Text files modified: trunk/configure.ac |18 + +++-- trunk/opal/mca/base/mca_base_component_find.c | 8 + +++ trunk/opal/mca/base/mca_base_component_repository.c |24 + +++ 3 files changed, 48 insertions(+), 2 deletions(-) Modified: trunk/configure.ac = = --- trunk/configure.ac (original) +++ trunk/configure.ac 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -995,10 +995,15 @@ ompi_show_subtitle "Libtool configuration" +m4_if(m4_version_compare(m4_defn([LT_PACKAGE_VERSION]), 2.0), -1, [ AC_LIBLTDL_CONVENIENCE(opal/libltdl) AC_LIBTOOL_DLOPEN AC_PROG_LIBTOOL - +], [ +LT_CONFIG_LTDL_DIR([opal/libltdl], [subproject]) +LTDL_CONVENIENCE +LT_INIT([dlopen win32-dll]) +]) ompi_show_subtitle "GNU libltdl setup" # AC_CONFIG_SUBDIRS appears to be broken for non-gcc compilers (i.e., @@ -1038,6 +1043,13 @@ if test "$HAPPY" = "1"; then LIBLTDL_SUBDIR=libltdl +CPPFLAGS_save="$CPPFLAGS" +CPPFLAGS="-I." +AC_EGREP_HEADER([lt_dladvise_init], [opal/libltdl/ltdl.h], +[OPAL_HAVE_LTDL_ADVISE=1], +[OPAL_HAVE_LTDL_ADVISE=0]) +CPPFLAGS="$CPPFLAGS" + # Arrgh. This is gross. But I can't think of any other way to do # it. :-( @@ -1057,7 +1069,7 @@ AC_MSG_WARN([libltdl support disabled (by --disable-dlopen)]) LIBLTDL_SUBDIR= -LIBLTDL= +OPAL_HAVE_LTDL_ADVISE=0 # append instead of prepend, since LIBS are going to be system # type things needed by everyone. Normally, libltdl will push @@ -1073,6 +1085,8 @@ AC_DEFINE_UNQUOTED(OMPI_WANT_LIBLTDL, $OMPI_ENABLE_DLOPEN_SUPPORT, [Whether to include support for libltdl or not]) +AC_DEFINE_UNQUOTED(OPAL_HAVE_LTDL_ADVISE, $OPAL_HAVE_LTDL_ADVISE, +[Whether libltdl appears to have the lt_dladvise interface]) ## # visibility Modified: trunk/opal/mca/base/mca_base_component_find.c = = --- trunk/opal/mca/base/mca_base_component_find.c (original) +++ trunk/opal/mca/base/mca_base_component_find.c 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -75,6 +75,10 @@ char name[MCA_BASE_MAX_COMPONENT_NAME_LEN]; }; typedef struct ltfn_data_holder_t ltfn_data_holder_t; + +#if OPAL_HAVE_LTDL_ADVISE +extern lt_dladvise opal_mca_dladvise; +#endif #endif /* OMPI_WANT_LIBLTDL */ @@ -387,7 +391,11 @@ /* Now try to load the component */ +#if OPAL_HAVE_LTDL_ADVISE + component_handle = lt_dlopenadvise(target_file->filename, opal_mca_dladvise); +#else component_handle = lt_dlopenext(target_file->filename); +#endif if (NULL == component_handle) { err = strdup(lt_dlerror()); if (0 != show_errors) { Modified: trunk/opal/mca/base/mca_base_component_repository.c = = --- trunk/opal/mca/base/mca_base_component_repository.c (original) +++ trunk/opal/mca/base/mca_base_component_repository.c 2007-08-17 00:08:23 EDT (Fri, 17 Aug 2007) @@ -85,6 +85,10 @@ static repository_item_t *find_component(const char *type, const char *name); static int link_items(repository_item_t *src, repository_item_t *depend); +#if OPAL_HAVE_LTDL_ADVISE +lt_dladvise opal_mca_dladvise; +#endif + #endif /* OMPI_WANT_LIBLTDL */ @@ -103,6 +107,20 @@ return OPAL_ERR_OUT_OF_RESOURCE; } +#if OPAL_HAVE_LTDL_ADVISE +if (lt_dladvise_init(_mca_dladvise)) { +return OPAL_ERR_OUT_OF_RESOURCE; +} + +if (lt_dladvise_ext(_mca_dladvise)) { +return OPAL_ERROR; +} + +if (lt_dladvise_global(_mca_dladvise)) { +return OPAL_ERROR; +} +#endif + OBJ_CONSTRUCT(,
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 9:33 AM, George Bosilca wrote: On Aug 13, 2007, at 11:28 AM, Pavel Shamis (Pasha) wrote: Jeff Squyres wrote: I guess reading the graph that Pasha sent is difficult; Pasha -- can you send the actual numbers? Ok here is the numbers on my machines: 0 bytes mvapich with header caching: 1.56 mvapich without header caching: 1.79 ompi 1.2: 1.59 So on zero bytes ompi not so bad. Also we can see that header caching decrease the mvapich latency on 0.23 1 bytes mvapich with header caching: 1.58 mvapich without header caching: 1.83 ompi 1.2: 1.73 And here ompi make some latency jump. In mvapich the header caching decrease the header size from 56bytes to 12bytes. What is the header size (pml + btl) in ompi ? The match header size is 16 bytes, so it looks like ours is already optimized ... Pasha -- Is your build of Open MPI built with --disable- heterogeneous? If not, our headers all grow slightly to support heterogeneous operations. For the heterogeneous case, a 1 byte message includes: 16 bytes for the match header 4 bytes for the Open IB header 1 byte for the payload 21 bytes total If you are using eager RDMA, there's an extra 4 bytes for the RDMA length in the footer. Without heterogeneous support, 2 bytes get knocked off the size of the match header, so the whole thing will be 19 bytes (+ 4 for the eager RDMA footer). There are also considerably more ifs in the code if heterogeneous is used, especially on x86 machines. Brian
[OMPI devel] Collectives interface change
Hi all - There was significant discussion this week at the collectives meeting about improving the selection logic for collective components. While we'd like the automated collectives selection logic laid out in the Collv2 document, it was decided that as a first step, we would allow more than one + basic compnents to be used for a given communicator. This mandated the change of a couple of things in the collectives interface, namely how collectives module data is found (passed into a function, rather tha a static pointer on the component) and a bit of the initialization sequence. The revised interface and the rest of the code is available in an svn temp branch: https://svn.open-mpi.org/svn/ompi/tmp/bwb-coll-select Thus far, most of the components in common use have been updated. The notable exception is the tuned collectives routine, which Ollie is updating in the near future. If you have any comments on the changes, please let me know. If not, the changes will move to the trunk once Ollie is completed with updating the tuned component. Brian
Re: [OMPI devel] Startup failure on mixed IPv4/IPv6 environment (oob tcp bug?)
On Aug 5, 2007, at 3:05 PM, dispan...@sobel.ls.la wrote: I fixed the problem by setting the peer_state to MCA_OOB_TCP_CONNECTING after creating the socket, which works for me. I'm not sure if this is always correct, though. Can you try the attached patch? It's pretty close to what you've suggested, but should eliminate one corner case that you could, in theory, run into with your solution. You are using a nightly tarball from the trunk, correct? Thanks, Brian oob_ipv6.diff Description: Binary data
Re: [OMPI devel] MPI_Win_get_group
On Jul 28, 2007, at 6:27 AM, Jeff Squyres wrote: On Jul 27, 2007, at 8:27 PM, Lisandro Dalcin wrote: MPI_WIN_GET_GROUP returns a duplicate of the group of the communicator used to create the window. associated with win. The group is returned in group. Well, it seems OMPI (v1.2 svn) is not returning a duplicate, comparing the handles with == C operator gives true. Can you confirm this? Should the word 'duplicate' be interpreted as 'a new reference to' ? I would tend to agree with this wording; I think we're doing the wrong thing. Brian -- what do you think? In my opinion, we conform to the standard. We reference count the group, it's incremented on call to MPI_WIN_GROUP, and you can safely call MPI_GROUP_FREE on the group returned from MPI_WIN_GROUP. Groups are essentially immutable, so there's no way I can think of that we violate the MPI standard. Others are, of course, free to disagree with me. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 26, 2007, at 1:01 PM, Mohamad Chaarawi wrote: On Thu, July 26, 2007 1:18 pm, Brian Barrett wrote: On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. Why would this mean having the sparse parameters in the group structure? not sure if i understood your question right, but in the group struct we added 5 integers and 3 pointer.. if we want to compile these out, we would then need all the #ifs around the code where we use these parameters.. I don't follow why you would need all the sparse stuff in ompi_group_t when OMPI_GROUP_SPARSE is 0. The OMPI_GROUP_IS and OMPI_GROU_SET macros only modify grp_flags, which is always present. Like the ompi_group_peer_lookup, much can be hidden inside the functions rather than exposed through the interface, if you're concerned about the other functionality currently #if'ed in the code. Brian
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 26, 2007, at 12:00 PM, Mohamad Chaarawi wrote: 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base That's what i actually was proposing.. keep the flags (there are no macros, just the GROUP_GET_PROC_POINTER) and the sparse parameters in the group strucutre, and this will mean, only 1 maybe 2 #ifs.. Why would this mean having the sparse parameters in the group structure? 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change the behavior of the ompi_group_peer_lookup() function, so that there is symmetry with how you get a proc from a communicator? static inline functions (especially short ones like that) are basically free. We'll still have to fix all the places in the code where the macro is used or people poke directly at the group structure, but I like static inline over macros whenever possible. So much easier t debug. Actually i never knew till this morning that this function was in the code.. I have an inline function ompi_group_lookup (which does the same thing), that actually checks if the group is dense or not and act accordingly.. but to use the inline function instead of the macro, means again that we need to compile in all the sparse parameters/code, which im for.. No, it doesn't. Just have something like: static inline ompi_proc_t* ompi_group_lookup(ompi_group_t *group, int peer) { #if OMPI_GROUP_SPARSE /* big long lookup function for sparse groups here */ #else return group->grp_proc_pointers[peer] #endif } Brian
Re: [OMPI devel] [RFC] Sparse group implementation
Mohamad - A couple of comments / questions: 1) Why do you need the #if OMPI_GROUP_SPARSE in communicator/comm.c? That seems like part of the API that should under no conditions change based on sparse/not sparse 2) I think it would be better to always have the flags and macros available (like OMPI_GROUP_SPORADIC and OMPI_GROUP_IS_SPORADIC) even when sparse groups are disabled. They dont' take up any space, and mean less #ifs in the general code base 3) Instead of the GROU_GET_PROC_POINTER macro, why not just change the behavior of the ompi_group_peer_lookup() function, so that there is symmetry with how you get a proc from a communicator? static inline functions (especially short ones like that) are basically free. We'll still have to fix all the places in the code where the macro is used or people poke directly at the group structure, but I like static inline over macros whenever possible. So much easier t debug. Other than that, I think you've got my concerns pretty much addressed. Brian On Jul 25, 2007, at 8:45 PM, Mohamad Chaarawi wrote: In the current code, almost all #ifs are due to the fact that we don't want to add the extra memory by the sparse parameters that are added to the group structure. The additional parameters are 5 pointers and 3 integers. If nobody objects, i would actually keep those extra parameters, even if sparse groups are disabled (in the default case on configure), because it would reduce the number of #ifs in the code to only 2 (as i recall that i had it before) .. Thank you, Mohamad On Wed, July 25, 2007 4:23 pm, Brian Barrett wrote: On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote: On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote: It just adds a lot of #if's throughout the code. Other than that, there's no reason to remove it. I agree, lots of #ifs are bad. But I guess I don't see the problem. The only real important thing people were directly accessing in the ompi_group_t is the array of proc pointers. Indexing into them could be done with a static inline function that just has slightly different time complexity based on compile options. Static inline function that just does an index in the group proc pointer would have almost no added overhead (none if the compiler doesn't suck). Ya, that's what I proposed. :-) But I did also propose removing the extra #if's so that the sparse group code would be available and we'd add an extra "if" in the critical code path. But we can do it this way instead: Still use the MACRO to access proc_t's. In the --disable-sparse- groups scenario, have it map to comm.group.proc[i]. In the -- enable- sparse-groups scenario, have it like I listed in the original proposal: static inline ompi_proc_t lookup_group(ompi_group_t *group, int index) { if (group_is_dense(group)) { return group->procs[index]; } else { return sparse_group_lookup(group, index); } } With: a) groups are always dense if --enable and an MCA parameter turns off sparse groups, or b) there's an added check in the inline function for whether the MCA parameter is on I'm personally in favor of a) because it means only one conditional in the critical path. I don't really care about the sparse groups turned on case. I just want minimal #ifs in the global code and to not have an if() { ... } in the critical path when sparse groups are disabled :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Mohamad Chaarawi Instructional Assistant http://www.cs.uh.edu/~mschaara Department of Computer ScienceUniversity of Houston 4800 Calhoun, PGH Room 526Houston, TX 77204, USA ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 25, 2007, at 3:14 PM, Jeff Squyres wrote: On Jul 25, 2007, at 5:07 PM, Brian Barrett wrote: It just adds a lot of #if's throughout the code. Other than that, there's no reason to remove it. I agree, lots of #ifs are bad. But I guess I don't see the problem. The only real important thing people were directly accessing in the ompi_group_t is the array of proc pointers. Indexing into them could be done with a static inline function that just has slightly different time complexity based on compile options. Static inline function that just does an index in the group proc pointer would have almost no added overhead (none if the compiler doesn't suck). Ya, that's what I proposed. :-) But I did also propose removing the extra #if's so that the sparse group code would be available and we'd add an extra "if" in the critical code path. But we can do it this way instead: Still use the MACRO to access proc_t's. In the --disable-sparse- groups scenario, have it map to comm.group.proc[i]. In the --enable- sparse-groups scenario, have it like I listed in the original proposal: static inline ompi_proc_t lookup_group(ompi_group_t *group, int index) { if (group_is_dense(group)) { return group->procs[index]; } else { return sparse_group_lookup(group, index); } } With: a) groups are always dense if --enable and an MCA parameter turns off sparse groups, or b) there's an added check in the inline function for whether the MCA parameter is on I'm personally in favor of a) because it means only one conditional in the critical path. I don't really care about the sparse groups turned on case. I just want minimal #ifs in the global code and to not have an if() { ... } in the critical path when sparse groups are disabled :). Brian
Re: [OMPI devel] [RFC] Sparse group implementation
On Jul 25, 2007, at 2:56 PM, Jeff Squyres wrote: On Jul 25, 2007, at 10:39 AM, Brian Barrett wrote: I have an even bigger objection than Rich. It's near impossible to measure the latency impact of something like this, but it does have an additive effect. It doesn't make sense to have all that code in the critical path for systems where it's not needed. We should leave the compile time decision available, unless there's a very good reason (which I did not see in this e-mail) to remove it. It just adds a lot of #if's throughout the code. Other than that, there's no reason to remove it. I agree, lots of #ifs are bad. But I guess I don't see the problem. The only real important thing people were directly accessing in the ompi_group_t is the array of proc pointers. Indexing into them could be done with a static inline function that just has slightly different time complexity based on compile options. Static inline function that just does an index in the group proc pointer would have almost no added overhead (none if the compiler doesn't suck). Brian
Re: [OMPI devel] [RFC] Sparse group implementation
I have an even bigger objection than Rich. It's near impossible to measure the latency impact of something like this, but it does have an additive effect. It doesn't make sense to have all that code in the critical path for systems where it's not needed. We should leave the compile time decision available, unless there's a very good reason (which I did not see in this e-mail) to remove it. Brian On Jul 25, 2007, at 8:00 AM, Richard Graham wrote: This is good work, so I am happy to see it come over. My initial understanding was that there would be compile time protection for this. In the absence of this, I think we need to see performance data on a variety of communication substrates. It seems like a latency measurement is, perhaps, the most sensitive measurement, and should be sufficient to see the impact on the critical path. Rich On 7/25/07 9:04 AM, "Jeff Squyres"wrote: WHAT:Merge the sparse groups work to the trunk; get the community's opinion on one remaining issue. WHY: For large MPI jobs, it can be memory-prohibitive to fully represent dense groups; you can save a lot of space by having "sparse" representations of groups that are (for example) derived from MPI_COMM_WORLD. WHERE: Main changes are (might have missed a few in this analysis, but this is 99% of it): - Big changes in ompi/group - Moderate changes in ompi/comm - Trivial changes in ompi/mpi/c, ompi/mca/pml/[dr|ob1], ompi/mca/comm/sm WHEN:The code is ready now in /tmp/sparse-groups (it is passing all Intel and IBM tests; see below). TIMEOUT: We'll merge all the work to the trunk and enable the possibility of using sparse groups (dense will still be the default, of course) if no one objects by COB Tuesday, 31 Aug 2007. = === === The sparse groups work from U. Houston is ready to be brought into the trunk. It is built on the premise that for very large MPI jobs, you don't want to fully represent MPI groups in memory if you don't have to. Specifically, you can save memory for communicators/groups that are derived from MPI_COMM_WORLD by representing them in a sparse storage format. The sparse groups work introduces 3 new ompi_group_t storage formats: * dense (i.e., what it is today -- an array of ompi_proc_t pointers) * sparse, where the current group's contents are based on the group from which the child was derived: 1. range: a series of (offset,length) tuples 2. stride: a single (first,stride,last) tuple 3. bitmap: a bitmap Currently, all the sparse groups code must be enabled by configuring with --enable-sparse-groups. If sparse groups are enabled, each MPI group that is created will automatically use the storage format that takes the least amount of space. The Big Issue with the sparse groups is that getting a pointer to an ompi_proc_t may no longer be an O(1) operation -- you can't just access it via comm->group->procs[i]. Instead, you have to call a macro. If sparse groups are enabled, this will call a function to do the resolution and return the proc pointer. If sparse groups are not enabled, the macro currently resolves to group->procs[i]. When sparse groups are enabled, looking up a proc pointer is an iterative process; you have to traverse up through one or more parent groups until you reach a "dense" group to get the pointer. So the time to lookup the proc pointer (essentially) depends on the group and how many times it has been derived from a parent group (there are corner cases where the lookup time is shorter). Lookup times in MPI_COMM_WORLD are O(1) because it is dense, but it now requires an inline function call rather than directly accessing the data structure (see below). Note that the code in /tmp/sparse-groups is currently out-of-date with respect to the orte and opal trees due to SVN merge mistakes and problems. Testing has occurred by copying full orte/opal branches from a trunk checkout into the sparse group tree, so we're confident that it's compatible with the trunk. Full integration will occur before commiting to the trunk, of course. The proposal we have for the community is as follows: 1. Remove the --enable-sparse-groups configure option 2. Default to use only dense groups (i.e., same as today) 3. If the new MCA parameter "mpi_use_sparse_groups" is enabled, enable the use of sparse groups 4. Eliminate the current macro used for group proc lookups and instead use an inline function of the form: static inline ompi_proc_t lookup_group(ompi_group_t *group, int index) { if (group_is_dense(group)) { return group->procs[index]; } else { return sparse_group_lookup(group, index); } } *** NOTE: This design adds a single "if" in some
Re: [OMPI devel] Fwd: [Open MPI] #1101: MPI_ALLOC_MEM with 0 size must be valid
On Jul 24, 2007, at 8:28 AM, Gleb Natapov wrote: On Tue, Jul 24, 2007 at 11:20:11AM -0300, Lisandro Dalcin wrote: On 7/23/07, Jeff Squyreswrote: Does anyone have any opinions on this? If not, I'll go implement option #1. Sorry, Jeff... just reading this. I think your option #1 is the better. However, I want to warn you about to issues: * In my Linux FC6 box, malloc(0) return different pointers for each call. In fact, I believe this is a requeriment for malloc, in the case man malloc tells me this: "If size was equal to 0, either NULL or a pointer suitable to be passed to free() is returned". So may be we should just return NULL and be done with it? Which is also what POSIX says: http://www.opengroup.org/onlinepubs/009695399/functions/malloc.html I vote with gleb -- return NULL, don't set errno, and be done with it. The way I read the advice to implementors, this would be a legal response to a 0 byte request. Brian
Re: [OMPI devel] [OMPI svn] svn:open-mpi r15492
Sigh. Thanks. Should probably have tested that code ;). And the solaris code. and the windows code. Brian On Jul 19, 2007, at 7:37 AM, Jeff Squyres wrote: Thanks! Ralph got it this morning in https://svn.open-mpi.org/trac/ompi/ changeset/15501. On Jul 19, 2007, at 5:34 AM, Bert Wesarg wrote: Hello, Author: brbarret Date: 2007-07-18 16:23:45 EDT (Wed, 18 Jul 2007) New Revision: 15492 URL: https://svn.open-mpi.org/trac/ompi/changeset/15492 Log: add ability to have thread-specific data on windows, pthreads, solaris threads, and non-threaded builds +int +opal_tsd_key_create(opal_tsd_key_t *key, +opal_tsd_destructor_t destructor) +{ +int i; + +if (!atexit_registered) { +atexit_registered = true; +if (0 != atexit(run_destructors)) { +return OPAL_ERR_TEMP_OUT_OF_RESOURCE; +} +} + +for (i = 0 ; i < TSD_ENTRIES ; ++i) { +if (entries[i].used == false) { +entries[i].used = true; +entries[i].value = NULL; +entries[i].destructor = destructor; +*key = i; break; +} +} +if (i == TSD_ENTRIES) return ENOMEM; + +return OPAL_SUCCESS; +} Bert ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] RML/OOB change heads up
Hey all - Thought I would give you guys a heads up on some code that will be coming into the trunk in the not too distant future (hopefully tomorrow?). The changes revolve around the RML/OOB interface and include: * General TCP cleanup for OPAL / ORTE * Simplifying the OOB by moving much of the logic into the RML * Allowing the OOB RML component to do routing of messages * Adding a component framework for handling routing tables * Moving the xcast functionality from the OOB base to its own framework The IPv6 code did some things that I (and I know George) didn't like. Some functions had their interface change depending on whether IPv6 support was enabled (taking either an sockaddr_in or sockaddr_in6 instead of just a sockaddr) and we were inconsistent about storage sizes. I've added a bunch of compatibility code to opal_config_bottom.h so that we can always have sockaddr_storage and some of the required IPv6 defines, which drastically simplified the IPv6 code in the TCP OOB. Previously, the OOB and RML component interfaces were essentially equivalent. This isn't surprising, as the RML was added at the last minute as a wrapper around the OOB as a forward looking way of solving multi-cell architectures. The interface into the OOB was also strange, requiring the upper layer (the RML) to call base functions that did a bit of work, then called the component. With this change, all the base code has been moved into the RML, and the OOB interface has been simplified by removing all the blocking and dss buffer communication. The RML now handles the implementation of blocking sends and dss buffer communication. This not only greatly simplifies writing an OOB component, but removes the base code in the oob, which was causing problems as it implied that there was one and only one oob component active at a time, which some people are apparently trying to break (by having multiple OOB components alive). The OOB RML can now also route messages, using a new framework (the routed framework) for determining how a message should be routed. Currently, only direct routing is supported, although that will change in the near future. The not-so-long term goal is to allow MPI processes to talk to each other and to the HNP through their local daemon, rather than directly. This will drastically reduce the number of sockets open in the system, which can only help with the speed thing. Finally, we moved the xcast functionality out of the OOB base and into its own framework. It really didn't make sense to have it in the OOB base, as it didn't do anything OOB specific and just utilized the RML to move data around. By moving it to its own framework, we can more easily experiment with new xcast protocols (using the component infrastructure, rather than the games Ralph currently has to play using MCA parameters and if statements). It also makes a clearer distinction as to which components are responsible for which functionality. Anyway, that's where we're at. You can take a look at the code in the temporary branch bwb-oob-rml-cleanup, although it currently does not work for singletons due to some merge conflicts from last night. This will be resolved before the merge back into the trunk, obviously. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] Build failures of 1.2.3 on Debian hppa, mips, mipsel, s390, m68k
the availability of functionality is set by the header files for each platform, not by configure. So we'd have to play some games to get at the information, but it should be possible. Brian On Jul 14, 2007, at 12:41 PM, George Bosilca wrote: Brian, We should be able to use these defines in the configure.m4 files for each component right ? I think the asm section is detected before we go in the component configuration. So far we know about the following components that have to disable themselves if no atomic or memory barrier is detected: - MPOOL: sm - BTL: sm, openib (completely or partially?) Anybody knows about any other components with atomic requirements ? george. On Jul 14, 2007, at 1:59 PM, Brian Barrett wrote: On Jul 14, 2007, at 11:51 AM, Gleb Natapov wrote: On Sat, Jul 14, 2007 at 01:16:42PM -0400, George Bosilca wrote: Instead of failing at configure time, we might want to disable the threading features and the shared memory device if we detect that we don't have support for atomics on a specified platform. In a non threaded build, the shared memory device is the only place where we need support for memory barrier. I'll look in the code to see why we need support for compare-and-swap on a non threaded build. Proper memory barrier is also needed for openib BTL eager RDMA support. Removed all the platform lists, since they won't care about this part :). Ah, true. The eager RDMA code should check that the preprocessor symbol OPAL_HAVE_ATOMIC_MEM_BARRIER is 1 and disable itself if that isn't the case. All the "sections" of ASM support (memory barriers, locks, compare-and-swap, and atomic math) have preprocessor symbols indicating whether support exists or not in the current build. These should really be used :). Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Build failures of 1.2.3 on Debian hppa, mips, mipsel, s390, m68k
On Jul 14, 2007, at 11:51 AM, Gleb Natapov wrote: On Sat, Jul 14, 2007 at 01:16:42PM -0400, George Bosilca wrote: Instead of failing at configure time, we might want to disable the threading features and the shared memory device if we detect that we don't have support for atomics on a specified platform. In a non threaded build, the shared memory device is the only place where we need support for memory barrier. I'll look in the code to see why we need support for compare-and-swap on a non threaded build. Proper memory barrier is also needed for openib BTL eager RDMA support. Removed all the platform lists, since they won't care about this part :). Ah, true. The eager RDMA code should check that the preprocessor symbol OPAL_HAVE_ATOMIC_MEM_BARRIER is 1 and disable itself if that isn't the case. All the "sections" of ASM support (memory barriers, locks, compare-and-swap, and atomic math) have preprocessor symbols indicating whether support exists or not in the current build. These should really be used :). Brian
Re: [OMPI devel] Build failures of 1.2.3 on Debian hppa, mips, mipsel, s390, m68k
On Jul 14, 2007, at 10:53 AM, Dirk Eddelbuettel wrote: Methinks we need to fill in a few blanks here, or make do with non-asm solutions. I don't know the problem space that well (being a maintainer rather than upstream developer) and am looking for guidance. Either way is an option. There are really only a couple of functions that have to be implemented: * atomic word-size compare and swap * memory barrier We'll emulte atomic adds and spin-locks with compare and swap if not directly implemented. The memory barrier functions have to exist, even if they don't do anything. We require compare-and-swap for a couple of pieces of code, which is why we lost our Sparc v8 support a couple of releases ago. For what it's worth, lam (7.1.2, currently) us available on all build architectures for Debian, but it may not push the (hardware) envelope as hard. Correct, LAM only had very limited ASM requirements (basically, memory barrier on platforms that required it -- like PowerPC). Brian
Re: [OMPI devel] Build failures of 1.2.3 on Debian hppa, mips, mipsel, s390, m68k
On Jul 14, 2007, at 8:26 AM, Dirk Eddelbuettel wrote: Please let us (ie Debian's openmpi maintainers) how else we can help. I am ccing the porters lists (for hppa, m68k, mips) too to invite them to help. I hope that doesn't get the spam filters going... I may contact the 'arm' porters once we have a failure; s390 and sparc activity are not as big these days. Open MPI uses some assembly for things like atomic locks, atomic compare and swap, memory barriers, and the like. We currently have support for: * x86 (32 bit) * x86_64 / amd64 (32 or 64 bit) * UltraSparc (v8plus and v9* targets) * IA64 * PowerPC (32 or 64 bit) We also have code for: * Alpha * MIPS (32 bit NEW ABI & 64 bit) This support isn't well tested in a while and it sounds like it doesn't work for MIPS. At one time, we supported the sparc v8 target, but that The other platforms (hppa, mipsel (how is this different than MIPS?), s390, m68k) aren't at all supported by Open MPI. If you can get the real error messages, I can help on the MIPS issue, although it'll have to be a low priority. We don't currently have support for a non-assembly code path. We originally planned on having one, but the team went away from that route over time and there's no way to build Open MPI without assembly support right now. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] Notes on building and running Open MPI on Red Storm
Do you have a Subversion account? If so, feel free to update the wiki ;). If not, we should probably get you an account. Then feel free to update the wiki ;). But thanks for the notes! Brian On Jul 11, 2007, at 4:47 PM, Glendenning, Lisa wrote: Some supplementary information to the wiki at https://svn.open-mpi.org/trac/ompi/wiki/CrayXT3. I. Accessing the Open MPI source: * Subversion is installed on redstorm in /projects/unsupported/bin * Reddish has subversion in the default path (you don't have to load a module) * The proxy information in the wiki is accurate, and works on both redstorm and reddish II. Building Open MPI on reddish: * 'module load PrgEnv-pgi-xc' to cross compile for redstorm * Reddish and redstorm do not have recent enough version of autotools, so you must build your own (currently available in /project/openmpi/rbbrigh/tools) * Suggested configuration: 'configure CC=qk-gcc CXX=qk-pgCC F77=qk-pgf77 FC=qk-pgf90 --disable-mpi-profile --with- platform=redstorm --host=x86_64-cray-linux-gnu --build=x86_64-unknown-linux-gnu --disable-mpi-f90 --disable-mem-debug --disable-mem-profile --disable-debug build_alias=x86_64-unknown-linux-gnu host_alias=x86_64-cray-linux-gnu' III. Building with Open MPI: * No executables will be installed in $PREFIX/bin, so must compile without mpicc, e.g. 'qk-gcc -I$PREFIX/include *.c -L$PREFIX/lib -lmpi -lopen-rte -lopen-pal' * When linking with libopen-pal, the following warning is normal: 'In function `checkpoint_response': warning: mkfifo is not implemented and will always fail' IV. Running on Redstorm: * scp your executable from reddish to redstorm * Use 'qsub' to submit job and 'yod' to launch job (if you do an interactive session, you can bypass PBS) * qsub expects project/task information - you can either provide this with -A option or set it in $XT_ACCOUNT environmental variable * Sample job script for qsub with 8 nodes/processes and 10 minute runtime: #!/bin/sh #PBS -lsize=8 #PBS -lwalltime=10:00 cd $PBS_O_WORKDIR yod a.out * Use 'xtshowmesh' and 'qstat' to query job status and cluster configuration ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Multi-environment builds
On Jul 10, 2007, at 7:09 AM, Tim Prins wrote: Jeff Squyres wrote: 2. The "--enable-mca-no-build" option takes a comma-delimited list of components that will then not be built. Granted, this option isn't exactly intuitive, but it was the best that we could think of at the time to present a general solution for inhibiting the build of a selected list of components. Hence, "--enable-mca-no-build=pls- slurm,ras-slurm" would inhibit building the SLURM RAS and PLS components (note that the SLURM components currently do not require any additional libraries, so a) there is no corresponding --with [out]- slurm option, and b) they are usually always built). Actually, there are --with-slurm/--without-slurm options. We default to building slurm support automatically on linux and aix, but not on other platforms. On a mostly unrelated note... We should probably also now build the SLURM component for OS X, since SLURM is now available for OS X as well. And probably should also check for SLURM's srun and build if we find it even if we aren't on Linux, AIX, or OS X. Brian
[OMPI devel] fake rdma flag again?
Hi all - I've finally committed a version of the rdma one-sided component that 1) works and 2) in certain situations actually does rdma. I'll make it the default when the BTLs are used as soon as one last bug is fixed in the DDT engine. However, there is still one outstanding issue. Some BTLs (like Portals or MX) advertise the ability to do a put but place restrictions on the put that only work for OB1. For example, both can only do an RDMA that starts where the prepare_dst() / prepare_src () call said the target buffer was. This isn't a problem for OB1, but kind of defeats the purpose of one-sided ;). There's also a reference count (I believe) in the Portals put/get code that would make life interesting if a descriptor was doing multiple RDMA ops at once. I was thinking that the easy way to solve this was to add a flag (FAKE_RDMA was the current running favorite, since we've used it before for different meaning :) ) to the components that have behaviors that work for OB1, but not a generalized rdma interface. I was wondering what people thought of this idea and if they had any preference for naming the flag. Brian
Re: [OMPI devel] One-sided operations with Portals
On Jul 5, 2007, at 4:16 PM, Glendenning, Lisa wrote: Ron Brightwell at SNL has asked me to look into optimizing Open MPI's one-sided operations over Portals. Does anyone have any guidance or thoughts for this? Hi Lisa - There are currently two implementations of the one-sided interface for Open MPI: pt2pt and rdma. The pt2pt component is implemented entirely over the interfaces used to implement the MPI-1 point-to-point interface. So it ends up doing lots of copies and is entirely two-sided. It could support async progress with threads, but that doesn't help the XT platform all that much. It was the first one-sided component implemented, mostly because we needed to support protocols like MX and PSM that don't really expose one-sided semantics, and I only wanted to support one new component per release. The rdma component is implemented over our BTL (byte transport layer -- the device driver our communication is written over), and can either use call-back based send/receive or true rdma. The true rdma is only for put/get for contiguous datatypes. The performance on OpenIB is ok, but not great (I'll send you some more details off list). I'd assume that the performance on Portals would be similar. However, the btl_put and btl_get implementation for the Portals BTL was implemented assuming it would only be used the way the PML (the MPI-1 point-to-point implementation) used it. It won't work with the rdma one-sided component at this time. I can go into more details if you decide that fixing the Portals BTL to support the rdma component is a path you want to look at. Then, of course, there's the option of writing a Portals-specific one- sided component. The component interface is pretty straight-forward -- it's the MPI-2 one-sided chapter interface functions, plus an initialization function. This is the path towards best performance, but also means the most code to write. The existing code in Open MPI handles the attribute management, but that's about it if you go this route. Of course, you can always copy freely from the rdma and pt2pt components. There used to be a document somewhere describing how to add a new component, but I think it is horribly out of date. I'll see if I can find it and send it your way. Of course, the first starting point is to get a checkout of the code and get it built. There are instructions for getting an SVN checkout of Open MPI (and how to get it built from there) available on the web page: http://www.open-mpi.org/svn/ Building on the XT platform (if you're going that route) is slightly more complicated, and you probably want to take a look at the horribly out of date wiki page on the subject here: https://svn.open-mpi.org/trac/ompi/wiki/CrayXT3 Hopefully, that's enough to get you started. If you have any questions, ask away. Brian -- Brian W. Barrett Networking Team, CCS-1 Los Alamos National Laboratory
Re: [OMPI devel] Modex
THe function name changes are pretty obvious (s/mca_pml_base/ompi/), and I thought I'd try something new and actually document the interface in the header file :). So we should be good on that front. Brian On Jun 27, 2007, at 6:38 AM, Terry D. Dontje wrote: I am ok with the following as long as we can have some sort of documenation describing what changed like which old functions are replaced with newer functions and any description of changed assumptions. --td Brian Barrett wrote: On Jun 26, 2007, at 6:08 PM, Tim Prins wrote: Some time ago you were working on moving the modex out of the pml and cleaning it up a bit. Is this work still ongoing? The reason I ask is that I am currently working on integrating the RSL, and would rather build on the new code rather than the old... Tim Prins brings up a point I keep meaning to ask the group about. A long time ago in a galaxy far, far away (aka, last fall), Galen and I started working on the BTL / PML redesign that morphed into some smaller changes, including some interesting IB work. Anyway, I rewrote large chunks of the modex, which did a couple of things: * Moved the modex out of the pml base and into the general OMPI code (renaming the functions in the process) * Fixed the hang if a btl doesn't publish contact information (we wait until we receive a key pushed into the modex at the end of MPI_INIT) * Tried to reduce the number of required memory copies in the interface It's a fairly big change, in that all the BTLs have to be updated due to the function name differences. It's fairly well tested, and would be really nice for dealing with platforms where there are different networks on different machines. If no one has any objections, I'll probably do this next week... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Modex
On Jun 26, 2007, at 6:08 PM, Tim Prins wrote: Some time ago you were working on moving the modex out of the pml and cleaning it up a bit. Is this work still ongoing? The reason I ask is that I am currently working on integrating the RSL, and would rather build on the new code rather than the old... Tim Prins brings up a point I keep meaning to ask the group about. A long time ago in a galaxy far, far away (aka, last fall), Galen and I started working on the BTL / PML redesign that morphed into some smaller changes, including some interesting IB work. Anyway, I rewrote large chunks of the modex, which did a couple of things: * Moved the modex out of the pml base and into the general OMPI code (renaming the functions in the process) * Fixed the hang if a btl doesn't publish contact information (we wait until we receive a key pushed into the modex at the end of MPI_INIT) * Tried to reduce the number of required memory copies in the interface It's a fairly big change, in that all the BTLs have to be updated due to the function name differences. It's fairly well tested, and would be really nice for dealing with platforms where there are different networks on different machines. If no one has any objections, I'll probably do this next week... Brian
Re: [OMPI devel] Patch to fix cross-compile failure
Argonne used AC_TRY_RUN instead of AC_TRY_COMPILE (I believe) because there are some places where aio functions behaved badly (returned ENOTIMPL or something). I was going to make it call AC_TRY_RUN if we weren't cross-compiling and AC_TRY_COMPILE if we were. I'll commit something this evening. Brian On Jun 11, 2007, at 6:07 AM, Jeff Squyres wrote: Paul -- Excellent; many thanks! Brian: this patch looks good to me, but I defer to the unOfficial OMPI ROMIO Maintainer (uOORM)... On Jun 8, 2007, at 3:33 PM, Paul Henning wrote: I've attached a patch relative to the revision 14962 version of ompi/mca/io/romio/romio/configure.in that fixes configuration errors when doing a cross-compile. It simply changes some tests for the number of parameters to aio_suspend and aio_write from AC_TRY_RUN to AC_TRY_COMPILE. Paul ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14768
I'm available this afternoon... Brian On Jun 7, 2007, at 12:39 PM, George Bosilca wrote: I'm available this afternoon. george. On Jun 7, 2007, at 2:35 PM, Galen Shipman wrote: Are people available today to discuss this over the phone? - Galen On Jun 7, 2007, at 11:28 AM, Gleb Natapov wrote: On Thu, Jun 07, 2007 at 11:11:12AM -0400, George Bosilca wrote: ) I expect you to revise the patch in order to propose a generic solution or I'll trigger a vote against the patch. I vote to be backed out of the trunk as it export way to much knowledge from the Open IB BTL into the PML layer. The patch solves real problem. If we want to back it out we need to find another solution. I also didn't like this change too much, but I thought about other solutions and haven't found something better that what Galen did. If you have something in mind lets discuss it. As a general comment this kind of discussion is why I prefer to send significant changes as a patch to the list for discussion before committing. george. PS: With Gleb changes the problem is the same. The following snippet reflect exactly the same behavior as the original patch. I didn't try to change the semantic. Just make the code to match the semantic that Galen described. -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [OMPI svn] svn:open-mpi r14897
Yup, thanks. Brian On Jun 6, 2007, at 2:27 AM, Bert Wesarg wrote: +#ifdef HAVE_REGEXEC +args_count = opal_argv_count(options_data[i].compiler_args); +for (j = 0 ; j < args_count ; ++j) { +if (0 != regcomp(, options_data[i].compiler_args [j], REG_NOSUB)) { +return -1; +} + +if (0 == regexec(, arg, (size_t) 0, NULL, 0)) { missing regfree();? +return i; +} + +regfree(); +} +#else regards Bert ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] undefined environ symbol on Darwin
On May 29, 2007, at 7:35 AM, Jack Howarth wrote: and if you see environ undefined, identify which library it is in and which object file it came from. I would also note that my patch reveals that several instances of the environ variable being declared that are missing the Windows wrappers. So if anything, adding the Darwin patch will increase the probability that both targets are properly maintained. Yes, there are significant portions of the code base that are "Unix- only" and not built on Windows. There are regular builds of Open MPI on Windows to ensure that problems are quickly resolved when they creep into the code base. The places where the Windows environ fixes are missing are likely that way because they are in parts of the code that doesn't build on Windows. As I've said, I'd be happy to commit a Mac OS X-specific fix for the environ problem if we can find a test case where it fails without the fix. I'm not going to commit portability fixes to Open MPI for a problem that we can't replicate. Based on what Peter said on the apple list, there is no problem with having an undefined symbol in a shared library (other than the fact that *that* shared library must be built with a flat namespace). I'm working with someone here to get ParaView built on my Mac so I can trace down the problem and figure out if Open MPI is responsible for the missing symbol. Brian
Re: [OMPI devel] undefined environ symbol on Darwin
On May 28, 2007, at 4:57 PM, Jack Howarth wrote: I have been told that Paraview is one package that exhibits this problem with undefined environ symbols. This will occur in any package which creates its own shared libraries that link in any openmpi shared library that contains the undefined environ symbol. I think it is unreasonably restrictive to force all the application developers who use openmpi to avoid creating shared libs that use openmpi shared libraries. Again from the response on the Darwin mailing list this is expected behavior on Darwin. I will send two patches shortly that address this without needing to touch configure. Have you tried it? I ask because I have. I created a shared library that called MPI_COMM_SPAWN (to make sure that it called a function that needed environ). Then created an application that called the function in that shared library. Both the new shared library and the application were able to link without problems. Both the Fink page and the Apple list post indicate that there's a problem *creating* a shared library with undefined symbols. There appears to be no evidence to date that there's a problem creating a shared library that itself does not have undefined symbols but links to an application that does. Given that I was unable to make it fail, I question whether this is a problem. I'm hesitant to make this change because these types of things are hard to maintain. SInce we don't have a test case that fails, it's impossible to properly test. And since it's obscure and works in the common case, it's unlikely to be properly maintained over the long run. If an example where this fails is presented, I'm happy to make the changes. Until then, it just doesn't make sense. I'm not trying to be unreasonable, but I don't want to add unmaintainable code without at least a direct example of failure. Brian
Re: [OMPI devel] [RFC] Send data from the end of a buffer during pipeline proto
On the other hand, since the MPI standard explicitly says you're not allowed to call fork() or system() during the MPI application and sense the network should really cope with this in some way, if it further complicates the code *at all*, I'm strongly against it. Especially since it won't really solve the problem. For example, with one-sided, I'm not going to go out of my way to send the first and last bit of the buffer so the user can touch those pages while calling fork. Also, if I understand the leave_pinned protocol, this still won't really solve anything for the general case -- leave pinned won't send any data eagerly if the buffer is already pinned, so there are still going to be situations where the user can cause problems. Now we have a situation where sometimes it works and sometimes it doesn't and we pretend to support fork()/system() in certain cases. Seems like actually fixing the problem the "right way" would be the right path forward... Brian On May 17, 2007, at 10:10 AM, Jeff Squyres wrote: Moving to devel; this question seems worthwhile to push out to the general development community. I've been coming across an increasing number of customers and other random OMPI users who use system(). So if there's zero impact on performance and it doesn't make the code [more] incredibly horrible [than it already is], I'm in favor of this change. On May 17, 2007, at 7:00 AM, Gleb Natapov wrote: Hi, I thought about changing pipeline protocol to send data from the end of the message instead of the middle like it does now. The rationale behind this is better fork() support. When application forks, child doesn't inherit registered memory, so IB providers educate users to not touch buffers that were owned by the MPI before fork in a child process. The problem is that granularity of registration is HW page (4K), so last page of the buffer may contain also other application's data and user may be unaware of this and be very surprised by SIGSEGV. If pipeline protocol will send data from the end of a buffer then the last page of the buffer will not be registered (and first page is never registered because we send beginning of the buffer eagerly with rendezvous packet) so this situation will be avoided. It should have zero impact on performance. What do you think? How common for MPI applications to fork()? -- Gleb. ___ devel-core mailing list devel-c...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel-core -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Autotools Upgrade Time
Hi all - As was discussed on the telecon a couple of weeks ago, to try to lower the maintenance cost of the build system, starting this Saturday Autoconf 2.60 and Automake 1.10 will be required to successfully run autogen.sh on the trunk. As I mentioned in a previous e-mail, the required versions of the autotools will be: AutoconfAutomakeLibtool v1.12.57-2.59 1.9.6 1.5.22 v1.22.57-new1.9.6-new 1.5.22-new trunk 2.60-new1.10.0-new 1.5.22-new This means that there's no set of autotools that will be able to build all three versions of Open MPI, but since very few people currently spend time on v1.1, this should not present a major problem. Brian
Re: [OMPI devel] Fancy ORTE/MPIRUN bugs
On Apr 19, 2007, at 8:38 AM, Aurelien Bouteiller wrote: Hi, I am experiencing several fancy bugs with ORTE. All bugs occur on Intel 32 bits architecture under Mac OS X using gcc 4.2. The tested version is todays trunk (it also have occured for at least three weeks) First occurs when compiling in "optimized" mode (aka configure --disable-debug --with-platform=optimized) and does not occur in debug mode. Fixed as of r14440. Was caused by a faulty compiler hint that allowed the compiler to optimize out some much needed checks on the input. The other one occurs when running MPI program without mpirun (I know this is pretty useless but still ;) ). This bug does not require specific compilation options to occur. Running mpirun -np 1 mympiprogram is fine, but running mympiprogram fails with segfault in MPI_Finalize: ~/ompi$ mpirun -np 1 mpiself ~/ompi$ gdb mpiself (gdb) r Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: KERN_INVALID_ADDRESS at address: 0x77767578 0x90002e46 in szone_malloc () As of r14440, I'm unable to replicate, but it could have been one of those getting lucky issues. Can you see if the problem is still occurring? Brian
Re: [OMPI devel] SOS!! Run-time error
Wow, it appears everything aborts when opal_event_loop() is called. Did you make any changes to the event library code in opal/event/? If not, that might indicate a mismatch between the binaries and libraries (ie, binaries from one build vs. libraries from another). This will cause random segfaults, possibly like this. If that's no help, can you run ompi_info under gdb and generate a detailed stack trace? Thanks, Brian On Apr 15, 2007, at 11:40 AM, chaitali dherange wrote: I have downloaded the developer version of source code by downloading a nightly Subversion snapshot tarball.And have installed the openmpi. Using ./configure --prefix=/net/hc293/chaitali/openmpi_dev (lots of output... without errors) make all install. (lots of output... without errors) then I have tried to run the example provided in this version of source code... the ring_c.c file... I first copied it to my home directory... /net/hc293/chaitali now when inside my home directory... i did set path=($path /net.hc293/chaitali/openmpi_dev/bin) set $LD_LIBRARY_PATH = ( /net/hc293/chaitali/dev_openmpi/lib ) mpicc -o chaitali_test ring_c.c (This gave no errors at all) mpirun --prefix /net/hc293/chaitali/openmpi_dev -np 3 --hostfile / net/hc293/chaitali/machinefile ./test_chaitali (This gave foll errors..) [oolong:09783] *** Process received signal *** [oolong:09783] Signal: Segmentation fault (11) [oolong:09783] Signal code: (128) [oolong:09783] Failing at address: (nil) [oolong:09783] [ 0] /lib64/tls/libpthread.so.0 [0x2a95e01430] [oolong:09783] [ 1] /net/hc293/chaitali/openmpi_dev/lib/libopen- pal.so.0(opal_event_init+0x166) [0x2a957d9e16] [oolong:09783] [ 2] /net/hc293/chaitali/openmpi_dev/lib/libopen- rte.so.0(orte_init_stage1+0x168) [0x2a95680638] [oolong:09783] [ 3] /net/hc293/chaitali/openmpi_dev/lib/libopen- rte.so.0(orte_system_init+0xa) [0x2a9568375a] [oolong:09783] [ 4] /net/hc293/chaitali/openmpi_dev/lib/libopen- rte.so.0(orte_init+0x49) [0x2a95680329] [oolong:09783] [ 5] mpirun(orterun+0x155) [0x4029fd] [oolong:09783] [ 6] mpirun(main+0x1b) [0x4028a3] [oolong:09783] [ 7] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a95f273fb] [oolong:09783] [ 8] mpirun [0x4027fa] [oolong:09783] *** End of error message *** Segmentation fault I understand that the [5] and [6] are the actual errors. But dont understand why? or how to overcome this error? Please find attached the foll files: - 'ring_c.c' file which I am trying to run. - 'config.log' file from the openmpi-1.2.1a0r14362 folder - 'ompi_info --all.txt' which is the the output of ompi_info -- all... This contains the above mentioned errors. Thanks and Regards, Chaitali ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level
On Apr 2, 2007, at 10:23 AM, Jeff Squyres wrote: On Apr 1, 2007, at 3:12 PM, Ralph Castain wrote: I can't help you with the BTL question. On the others: 2. Go through the BML instead -- the BTL Management Layer. This is essentially a multiplexor for all the BTLs that have been instantiated. I'm guessing that this is what you want to do (remember that OMPI has true multi-device support; using the BML and multiple BTLs is one of the ways that we do this). Have a look at ompi/mca/bml/bml.h for the interface. There is also currently no mechanism to get the BML and BTL pointers that were instantiated by the PML. However, if you're just doing proof-of-concept code, you can extract these directly from the MPI layer's global variables to see how this stuff works. To have full interoperability of the underlying BTLs and between multiple upper-layer communication libraries (e.g., between OMPI and something else) is something that we have talked about a little, but have not done much work on. To see the BTL interface (just for completeness), see ompi/mca/btl/ btl.h. Jumping in late to the conversation, and on an unimportant point for what Pooja really wants to do, but... The BTL really can't be used directly at this point -- you have to use the BML interface to get data pointers and the like. There's never any need to grab anything from the PML or global structures. The BML information is contained on a pointer on the ompi_proc_t structure associated with each peer. The list of peers can be accessed with the ompi_proc_world() call. Hope this helps, Brian
Re: [OMPI devel] comment on wiki/PrintfCodes
On Feb 26, 2007, at 1:54 PM, Bert Wesarg wrote: I can only speak for a 3 year old linux system but I read evenly the wiki page https://svn.open-mpi.org/trac/ompi/wiki/PrintfCodes and I wonder if someone tried this code. On my system the PRId32 is defined as "d" for example. so to use this you need to write something like this: printf("foo: %" PRIu32 ", bar: %ld\n", foo, bar); ^ note this extra '%'. on the other hand printf have an extra length specifier for size_t, its 'z', so a minimal size_t printf conversion is "%zu". Thanks, I've fixed the PRI usage case. Unfortunately, %zu isn't recognized by some versions of printf, so we can't use it in Open MPI. BTW: are there any plans to provide mpi datatypes for these stdint.h types like {,u}int{8,16,32,64,max,ptr}_t? Not at this time. Brian
Re: [OMPI devel] [PATCH] ompi_get_libtool_linker_flags.m4: fix $extra_ldflags detection
Very true, thanks. I'll fix this evening. Brian On Feb 25, 2007, at 4:51 AM, Bert Wesarg wrote: Hallo, ok the sed should be even more portable. but the problem with a CC like "gcc -m32" isn't solved, so you should add this line and use the $tmpCC in the sed expression, to get "gcc -m32" removed: tmpCC=`echo $CC` Bert Brian W. Barrett wrote: Thanks for the bug report and the patch. Unfortunately, the remove smallest prefix pattern syntax doesn't work with Solaris /bin/sh (standards would be better if everyone followed them...), but I committed something to our development trunk that handles the issue. It should be releases as part of v1.2.1 (we're too far in testing to make it part of v1.2). Thanks, Brian On Feb 15, 2007, at 9:12 AM, Bert Wesarg wrote: Hello, when using a multi token CC variable (like "gcc -m32"), the logic to extract $extra_ldflags from libtool don't work. So here is a little hack to remove the $CC prefix from the libtool-link cmd. Bert Wesarg diff -ur openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4 openmpi-1.1.4-extra_ldflags-fix/config/ ompi_get_libtool_linker_flags.m4 --- openmpi-1.1.4/config/ompi_get_libtool_linker_flags.m4 2006-04-12 18:12:28.0 +0200 +++ openmpi-1.1.4-extra_ldflags-fix/config/ ompi_get_libtool_linker_flags.m42007-02-15 15:11:28.285844893 +0100 @@ -76,11 +76,15 @@ cmd="$libtool --dry-run --mode=link --tag=CC $CC bar.lo libfoo.la - o bar $extra_flags" ompi_check_linker_flags_work yes +# use array initializer to remove multiple spaces in $CC +tempCC=($CC) +tempCC="${tempCC[@]}" +output="${output#$tempCC}" +unset tempCC eval "set $output" extra_ldflags= while test -n "[$]1"; do case "[$]1" in -$CC) ;; *.libs/bar*) ;; bar*) ;; -I*) ;; ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] installed wrappers
On Feb 15, 2007, at 2:54 AM, Bert Wesarg wrote: why are the mpiCC, mpif77, and mpif90 wrappers installed, when i specify --disable-mpi-cxx, --disable-mpi-f77, and --disable-mpi-f90 for the ./configure? The Fortran 77 and Fortran 90 compilers will be disabled and return an error if those language bindings are disabled. This seemed to be easier for users to deal with than sometimes not having the wrapper compilers available. And also made it more clear to users when they were using a build of Open MPI without those bindings, which removed support cost from us. The C++ wrapper is a slightly more complicated issue. Many users want to compile C++ code, but still use the C bindings. So they expect mpiCC/mpic++ to work even when the C++ bindings aren't installed (just without linking in the C++ bindings). Brian
Re: [OMPI devel] build problem with 1.1.4
On Feb 15, 2007, at 3:07 AM, Bert Wesarg wrote: I encounter a build problem with openmpi 1.1.4, which don't show up with version 1.1.2. After a simple ./configure, the variable OPAL_DATADIR in opal/include/opal/install_dirs.h shows this: $ grep '^#define OPAL_DATADIR' openmpi-1.1.2/opal/include/opal/ install_dirs.h #define OPAL_DATADIR "/usr/local/share" $ grep '^#define OPAL_DATADIR' openmpi-1.1.4/opal/include/opal/ install_dirs.h #define OPAL_DATADIR "${prefix}/share" this results in the problem, that the opal_wrapper can't find the wrapper data files in /share/openmpi/. Is this with a SVN checkout or the release tarball? The issue you are seeing is a known issue if you use Autoconf 2.60 or higher to create the build system for Open MPI 1.1.x. The release tarball is built with Autoconf 2.59 and I just checked to verify that 1.1.4 was in fact using AC 2.59 and not creating the bad datadir defines. You might want to make sure that some part of your build was not rerunning autoconf in the release source code. Brian
[OMPI devel] (no subject)
Hi all - I have four changes I'd like to make to the wrapper compilers (well, more the build system). I figured I'd make them available for public comment before I did any of them, as they would change how things got installed and what the user sees: 1) Only install opal{cc,c++} and orte{cc,c++} if configured with -- with-devel-headers. Right now, they are always installed, but there are no header files installed for either project, so there's really not much way for a user to actually compile an OPAL / ORTE application. 2) Drop support for opalCC and orteCC. It's a pain to setup all the symlinks (indeed, they are currently done wrong for opalCC) and there's no history like there is for mpiCC. This isn't a big deal, but would make two Makefiles easier to deal with. And since about every 3 months, I have to fix the Makefiles after they get borked up a little bit, it makes my life easier. 3) Change what is currently opalcc.1 (name it something generic, probably opal_wrapper.1) and add some macros that get sed'ed so that the man pages appear to be customized for the given command. Josh and I had talked about this long ago, but time prevented us from actually doing anything. 4) Install the wrapper data files even if we compiled with --disable- binaries. This is for the use case of doing multi-lib builds, where one word size will only have the library built, but we need both set of wrapper data files to piece together to activate the multi-lib support in the wrapper compilers. Comments? Brian -- Brian Barrett Open MPI developer http://www.open-mpi.org/
[OMPI devel] configure changes (ooops!)
Hi all - At the last minute last night I wanted to change one small detail in the wrapper compiler code. Then, as is typical with me, I got distracted. As some of you noticed, none of the configure changes made it into the trunk last night. Should happen this weekend. Sorry about that! Brian
Re: [OMPI devel] Buffer Overflow Error
What facilities are you using to detect the buffer overflow? We've seen no such issues in our testing and I'd be surprised if there was an issue in that code path. Valgrind and friends don't show any issues on our test machines, so without more detail, I'm afraid we really can't fix the issue you are seeing. Brian On Thu, 2006-08-24 at 13:53 -0400, Dave Rogers wrote: > I just compiled the latest version on my machine and ran a dumb test - > mpirun without any arguments. > This generated a buffer overflow error! > > Error message (reproducible with different mem. addr.s): > [ /home/dave/rpmbuild ] $ mpirun > *** buffer overflow detected ***: mpirun terminated > === Backtrace: = > /lib64/libc.so.6(__chk_fail+0x2f)[0x31669dee3f] > /lib64/libc.so.6[0x31669de69b] > /lib64/libc.so.6(__snprintf_chk+0x7b)[0x31669de56b] > /usr/lib64/libopal.so.0(opal_cmd_line_get_usage_msg > +0x20a)[0x2ac1088a] > mpirun[0x403c53] > mpirun(orterun+0xa0)[0x402798] > mpirun(main+0x1b)[0x4026f3] > /lib64/libc.so.6(__libc_start_main+0xf4)[0x316691d084] > mpirun[0x402649] > === Memory map: > 0040-00408000 r-xp 09:01 > 2697992/usr/bin/orterun > ... > 7fff20e92000-7fff20ea8000 rw-p 7fff20e92000 00:00 0 > [stack] > ff60-ffe0 ---p 00:00 0 > [vdso] > Aborted > > Installation details: System: FC5 AMD Opteron x86_64 > downloaded SRPM version 1.1.1 > > rpm -ivh /usr/local/src/dist/libs/openmpi- 1.1-1.src.rpm > rpmbuild -ba SPECS/openmpi-1.1.spec --target x86_64 > - generates an error from check-rpaths stating that the /usr/lib64 > prefix is unnecessary and may cause problems > QA_RPATHS=$[ 0x0001|0x0010 ] rpmbuild -ba SPECS/openmpi- 1.1.spec > --target x86_64 > - suggessted workaround - ignores as warnings > rpm -ivh ~dave/rpmbuild/RPMS/x86_64/openmpi-1.1-1.x86_64.rpm > - generates a package conflict -- file /usr/lib64/libopal.so from > install of openmpi-1.1-1 conflicts with file from package opal-2.2.1-1 > - apparently, this comes from opal, the open phone abstraction > library... so I uninstalled opal > rpm -ivh ~dave/rpmbuild/RPMS/x86_64/openmpi-1.1-1.x86_64.rpm > - worked! > > The strange thing is that mpirun with normal arguments works as > expected without any sorts of mem. errors. > mpirun with flags -h or --help also buffer overflows, but not mpirun > with an unrecognized argument, to which it spits out a "you must > specify how many processes to launch, via the -np argument." error. > > I hope this gets fixed soon, buffer overflows are potential security > vulnerabilities. > > ~ David Rogers > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Stack trace printing
Yes. It's always the trampoline, the signal handler, and the stack trace printer. Brian On Wed, 2006-08-30 at 17:37 -0400, Jeff Squyres wrote: > As long as it's always 3 function calls -- do we know that it will be? > > > On 8/30/06 5:32 PM, "Brian Barrett" <brbar...@open-mpi.org> wrote: > > > Hi all- > > > > A question about stack tracing. Currently, we have it setup so that, > > say, a segfault results in: > > > > [0]func:/u/jjhursey/local/odin/ompi/devel/lib/libopal.so.0(opal_backtrace_prin > > t+0x2b) [0x2a959166ab] > > [1] func:/u/jjhursey/local/odin/ompi/devel/lib/libopal.so.0 [0x2a959150bb] > > [2] func:/lib64/tls/libpthread.so.0 [0x345cc0c420] > > [3] > > func:/san/homedirs/jjhursey/local/odin//ompi/devel/lib/openmpi/mca_oob_tcp.so( > > mca_oob_tcp_recv+0x480) [0x2a95fd6354] > > [4] > > func:/u/jjhursey/local/odin/ompi/devel/lib/liborte.so.0(mca_oob_recv_packed+0x > > 46) [0x2a957a96a3] > > [5] > > func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_connect_accep > > t+0x1d8) [0x2a955a29dc] > > [6] > > func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_dyn_init+0x11 > > 0) [0x2a955a49e0] > > > > This seems to result in confusion from some users (not josh, I was just > > reading his latest bug when I thought of this) that the error must be in > > OMPI because that's where it segfaulted. It would be fairly trivial (at > > least, on Linux and OS X) to not print the last 3 lines such that the > > error looked like: > > > > [0] > > func:/san/homedirs/jjhursey/local/odin//ompi/devel/lib/openmpi/mca_oob_tcp.so( > > mca_oob_tcp_recv+0x480) [0x2a95fd6354] > > [1] > > func:/u/jjhursey/local/odin/ompi/devel/lib/liborte.so.0(mca_oob_recv_packed+0x > > 46) [0x2a957a96a3] > > [2] > > func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_connect_accep > > t+0x1d8) [0x2a955a29dc] > > [3] > > func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_dyn_init+0x11 > > 0) [0x2a > > > > Would anyone object to such a change? > > > > Brian > > > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > >
[OMPI devel] Stack trace printing
Hi all- A question about stack tracing. Currently, we have it setup so that, say, a segfault results in: [0]func:/u/jjhursey/local/odin/ompi/devel/lib/libopal.so.0(opal_backtrace_print+0x2b) [0x2a959166ab] [1] func:/u/jjhursey/local/odin/ompi/devel/lib/libopal.so.0 [0x2a959150bb] [2] func:/lib64/tls/libpthread.so.0 [0x345cc0c420] [3] func:/san/homedirs/jjhursey/local/odin//ompi/devel/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x480) [0x2a95fd6354] [4] func:/u/jjhursey/local/odin/ompi/devel/lib/liborte.so.0(mca_oob_recv_packed+0x46) [0x2a957a96a3] [5] func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_connect_accept+0x1d8) [0x2a955a29dc] [6] func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_dyn_init+0x110) [0x2a955a49e0] This seems to result in confusion from some users (not josh, I was just reading his latest bug when I thought of this) that the error must be in OMPI because that's where it segfaulted. It would be fairly trivial (at least, on Linux and OS X) to not print the last 3 lines such that the error looked like: [0] func:/san/homedirs/jjhursey/local/odin//ompi/devel/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_recv+0x480) [0x2a95fd6354] [1] func:/u/jjhursey/local/odin/ompi/devel/lib/liborte.so.0(mca_oob_recv_packed+0x46) [0x2a957a96a3] [2] func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_connect_accept+0x1d8) [0x2a955a29dc] [3] func:/u/jjhursey/local/odin/ompi/devel/lib/libmpi.so.0(ompi_comm_dyn_init+0x110) [0x2a Would anyone object to such a change? Brian
Re: [OMPI devel] OpenRTE and Threads
In general, I think making the Public interface to OpenRTE not thread safe is a reasonable thing to do. However, I have some concern over how this would work with the event library. When the project is compiled with progress threads, the event library runs in its own thread. More important to this discussion, all callbacks from the event library are triggered in the callback thread (not the thread that registered the event), meaning that it's very likely the GPR could get a callback from a non-blocking OOB receive in a thread that is other than the main thread of the application and that it could happen while the main thread of the application is already in the GPR. Not sure what the best way to handle this would be, but I don't think you could do it from the event level without making adjustments that would prohibit concurrency at the MPI layer, which is obviously sub-optimal. Of course, we could modify the code so that non-OMPI applications didn't start the event progress thread, but that wouldn't solve the MPI-layer issues. Brian On Fri, 2006-08-25 at 14:14 -0600, Ralph Castain wrote: > There has been ongoing discussion for some time about the thread safety of > OpenRTE. This note proposes a solution to that problem that has been kicked > around for the last several months, and that Jeff and I feel makes a certain > degree of sense. > > Short description > - > We propose to make OpenRTE appear "single-threaded" to outside users. By > that we do not mean that OpenRTE may not have some internal threads in > operation. Instead, we mean that thread locking would be the responsibility > of anyone calling an OpenRTE function - as opposed to built into the OpenRTE > system itself. > > Explanation > - > Most of the logic inside of OpenRTE is serial in nature and therefore > resistant to the use of threads. Accordingly, we find ourselves putting > giant thread locks around virtually every function in the code base. This > wastes our time, complicates the code (we all keep forgetting to unlock when > exiting due to errors), and basically eliminates any benefits from threading > anyway. > > Those few places where threading is possible are actually involved in > OpenRTE-internal operations. For example, we now use a thread to accept > out-of-band communication socket connections. These operations, however, are > transparent to the user level (i.e., any code that calls OpenRTE). > > It seems, therefore, that the simplest solution is to place the > responsibility for thread locking onto the calling programs. Unthreaded > programs need do nothing. Programs utilizing threads, however, would need to > thread lock prior to calling OpenRTE functions. > > Any comments on this idea? If not, or if there is general consensus on this > approach, then we would gradually remove the current thread locks as code is > revised - this isn't a high priority issue requiring an immediate scrub of > the code.
[OMPI devel] LANL ORTE todo / milestones
Hi all - LANL had an internal meeting yesterday trying to classify a number of issues we're having with the run-time environment for Open MPI and how to best prioritize team resources. We thought it would be good to both share the list (with priorities) with the group and to ask the group if there were other issues that need to be addressed (either short or long term). We've categorized the issues as performance related, robustness, and feature / platform support. The numbers are the current priority on our list, and items within a category are sorted by priority. PERFORMANCE: 5) 50% scale factor in process startup Start-up of non-MPI jobs has a strange bend in the timing curve when the number of processes we are trying to start is greater than or equal to 50% of the current allocation. It appears that starting a 16 process (1 ppn) job takes longer if there are 32 nodes in the allocation than if there are 64 nodes in the allocation. Assigned to: Galen 6) MPI_INIT startup timings In addition to seeming to suffer from the same 50% issue as the previous issue, there also appears to be a number of places in MPI_INIT where we spend a considerable amount of time when at scale, leading to startup times much worse than LA-MPI or MPIEXEC/MVAPICH. Assigned to: Galen ROBUSTNESS: 1) MPI process aborting issue This is the orted spin, MPI processes don't die, etc. issue that occurs when some process dies unexpectedly. Ralph has already sent a detailed e-mail to devel about this issue. Assigned to: Ralph 1.5) MPI_ABORT rework The MPI process aborting issue is going to require a rework of MPI_ABORT so that it uses the error manager instead of calling terminate_proc/terminate_job. Assigned to: Brian 2) ORTE hangs when start-up fails If an orted fails to start or fails to connect back to the HNP, the system hangs waiting for the callback. If a orted process fails to start entirely, we sometimes catch this. But we need a better mechanism for handling the general failure case. Assigned to: Ralph 3) Hardened cleanup of session directory While #1 should greatly help in ensuring that the session directory is cleaned up every time, there are still a number of race conditions that need to be sorted out. The goal is to develop a plan that ensures files that need to be removed are removed automatically a high percentage of the time, that there is a way to allow a tool like orte_clean to clean up everything it should clean up, and that there is a way to make sure files that should not be automatically removed aren't automatically removed. Assigned to: Brian 3.5) Process not found hangs See https://svn.open-mpi.org/trac/ompi/ticket/245 Assigned to: Ralph 7) Node death failures / hangs With the exception of BProc, if a node fails, we don't detect the failure. Even if we did detect the failure, we have no general mechanism for dealing with that failure. The bulk of this project is going to be adding a general SOH/SMR component that uses the OOB for timeout pings. Assigned to: Brian 15) More friendly error messages There are situations where we give something south of a useful error message when an error is found. We should play nicer with users. Assigned to: 16) Consistent error checking We've had a number of recent instances of errors occuring, but not being propogated / returned to the user simply because no one ever checked the return code. We need to audit most of ORTE to always check return codes. Assigned to: FEATURE / PLATFORM SUPPORT: 4) TM error handling TM, while used on a number of large systems LANL needs to support, is not exactly friendly to usage at scale. It seems that it likes to go away and cry to mamma for a couple seconds, returning system error messages, only to come back and be ok a second later. This means that every TM call needs to be handled as if it's going to fail, and we need to be prepared to re-initialize the system (if possible) when failures occur. In testing on t-bird, launching was usually pretty stable, but the calls to get the node allocations tended to result in the strange behavior. These should definitely be re-startable type errors Assigned to: Brian 8) Hetergeneous Issues Assigned to: 9) External connections This covers issues like those the Eclipse team is experiencing. If, for example, a TCP connection to the seed is severed, it causes Open RTE to call abort, which means Eclipse just aborted. That's not so good. There are other naming / status issues that also need to be handled here. Assigned to: 9.5) Fix/Complete orte-ps and friends orte-ps / orte-clean / etc. all depend on being able to make a connection to the orte universe that doesn't result in bad things happening. We should finish these things for
Re: [OMPI devel] exit declaration in configure tests
On Mon, 2006-08-21 at 09:38 +0200, Ralf Wildenhues wrote: > Revision 11268 makes me curious: > > |M /trunk/config/ompi_setup_cxx.m4 > | > | Reorder the C++ compiler discovery stages. Check first the compiler vendor > | before checking if we are able to compile the test program. This is required > | for windows as the C++ conftest.c file generated by configure cannot be > | compiled with the Microsoft cl.exe compiler (because of the exit function > | prototype). So if we detect a vendor equal to microsoft we will assume > | that the compiler is correctly installed (which is true on Windows most > | of the time anyway). > > I believe to have killed all problematic exit cases from the OpenMPI > configury some time ago. Did I miss any, or did the remaining ones > come from some other package (so we can fix that one)? Which Autoconf > version was used (2.60 should not use any exit declarations itself any > more)? For one, I think I forgot to commit the patch you sent (shame on me!). But I know George wasn't using AC 2.60 at the time. He was going to try that and see if it helped. Brian
Re: [OMPI devel] one-sided communication implementation
On Thu, 2006-07-20 at 11:56 +1000, gh rory wrote: > In the process of trying to create a wrapper for open mpi to another > language. Specifically, I am trying to understand how the remote > memory access/one-sided communication works in open mpi 1.1, and I am > having some trouble. > > I have begun by trying to trace the steps in a simple MPI_Get call. > It seems that ompi_osc_pt2pt_replyreq_recv in > ompi/mca/osc/pt2pt/osc_pt2pt_data_move.c is the function that receives > the data for the requesting process, however I have not been able to > find the part of the code that receives the request at the other end. > It looks like ompi_osc_pt2pt_component_fragment_cb in > osc_pt2pt_component.c sends the data back to the requesting process, > but I can't see where the data is actually copied. > > Can someone please point me in the right direction? Is there any > documentation on the one-sided communication implementation that I > should be reading? The one-sided component is layered on top of our BTL transport layer, which uses an active message callback on message arrival. The ompi_osc_pt2pt_component_fragment_cb() call is called whenever a new message has arrived. The function then dispatches based on message type. If you look at the case for OMPI_OSC_PT2PT_HDR_PUT, you see a call to ompi_osc_pt2pt_sendreq_recv_put(), which either uses the convertor (our datatype engine) to unpack the data in the ompi_convertor_unpack() call or posts a long message to receive the data. Hope this helps, Brian
[OMPI devel] trunk changes: F90 shared libraries / New one-sided component
Hi all - Two large changes to the SVN trunk just occurred which require an autogen.sh on your part. First, we now (mostly) support building the Fortran 90 MPI bindings library as a shared library. This has been something Dan and I have been working on since the Burlington meeting, and it's ready for wider testing. There are some things to pay attention to with this change: 1) If your Fortran 77 and Fortran 90 compilers have different names, you *MUST* update to libtool 2.0 or disable F90 support. 2) If your Fortran 77 and Fortran 90 compilers have the same name, you can continue using Libtool 1.5.22 3) On all platforms other than OS X, the f90 support library is built as a shared library by default (following the way the other libraries are built). OS X always builds a static library due to common block issues. Configure will determine if you are using an older version of libtool and the Fortran compilers will cause problem. Libtool 2.0 isn't at a stable release yet, but we need to provide a shared library for the bindings as part of the 1.2 release, so we'll have to deal with the pre-releases of Libtool. The nightly tarballs of the SVN trunk have been created using a pre-release of LT for about the last 2 weeks, so we don't anticipate any problems with this. Second, there are now two one-sided communication components. The one previously known as "pt2pt" has been renamed "rdma" and there is now a new component "pt2pt". The new "pt2pt" component is entirely (and somewhat inefficiently) implemented over the PML (two-sided) interface and was added to support the use of the CM PML / MTLs, which will be part of the 1.2 release. The "rdma" component will be preferred over the "pt2pt" component, but will only allow itself to be activated when a PML using the BML/BTL infrastructure is being used. While the "rdma" component doesn't use any of the BTL rdma interface at the moment, this is something I will be changing in the near future. So eventually, the name will be more fitting than it is right now. Both of these changes will require a full autogen.sh ; configure ; make cycle when you next SVN update. Brian
Re: [OMPI devel] progress thread check
On Thu, 2006-07-27 at 07:49 -0400, Graham E Fagg wrote: > Hi all > is there a single function call that components can use to check that the > progress thread is up and running ? Not really. But if the define OMPI_ENABLE_PROGRESS_THREADS is 1 and opal_using_threads() returns true, then it can be assumed the event progress thread is running. brian
Re: [OMPI devel] universal / "fat" binary support?
On Thu, 2006-07-27 at 15:21 -0700, Ben Byer wrote: > I'd like to be able to build OpenMPI "fat" -- for multiple > architectures in one pass, so I can create a universal binary for > OSX. I see that it was mentioned last year, at http://www.open- > mpi.org/community/lists/users/2005/06/0087.php as something that was > "a ways off". > > Has any progress been made on that front, or do you still plan to > support this? We currently can't build a universal binary in "one pass". We actually had a long discussion with some Apple engineers about the issue and what it came down to was that supporting a "one pass" build of Open MPI as a universal binary would take lots of development effort in making Autoconf, Automake, and Libtool smarter. We don't have the resources to do that work on the autotools and it doesn't sound like there is enough demand on the autotools authors for them to do the work, so it's unlikely we'll progress on that front for some time. We do provide a script in contrib/dist/macosx/ that will take a tarball and build a universal binary .pkg file. It ends up running the configure / compile sequence three times (PPC, PPC64, and x86), but it works quite well. Mostly because it works so well, it is very difficult to make further work on our build system to support a "one pass" build of a Universal Binary a high priority. Brian
[OMPI devel] SVN breakage / new event library committed
Hi all - I just finished committing the event library into the trunk. Unfortunately, because the event library was not imported using a vendor import 2 years ago, I had to do some things that made SVN a little unhappy. The good news is that the next libevent update will not require these changes. The bad news is that you have to follow some special instructions to properly update your SVN checkout. In particular, you need to completely delete the opal/event directory, then run svn up. If you svn up'ed before reading this e-mail, just rm -rf opal/event and svn up again. All should be good. After updating, you *MUST* re-run autogen.sh and configure (sorry!). Because I was already making everyone re-run autogen.sh, I also committed some code to opal that made the code to print a backtrace from some #ifs in opal/util/stacktrace.c to a full-blown framework. Terry added support for Solaris the other day, and I figured out how to support OS X. This made three possible setups, and OS X required a bunch of files, so it seemed that a framework was needed. Two notes about the OS X stacktrace support. First, it doesn't print a useful stack for 64 bit binaries yet, but I'm working on it. Second, there are some warnings about C++ comments in the code. PLEASE DO NOT FIX THESE. I will be fixing them shortly, but need to find a way that doesn't make future updates impossibly difficult. Brian
Re: [OMPI devel] Problem compiling openmpi 1.1
On Mon, 2006-07-10 at 17:44 +0200, Pierre wrote: > rtsig.c:365: error: `EV_SIGNAL' undeclared (first use in this function) > rtsig.c:392: error: dereferencing pointer to incomplete type > rtsig.c:392: error: `EV_PERSIST' undeclared (first use in this function) > make[3]: *** [rtsig.lo] Error 1 > make[3]: Leaving directory `/tmp/openmpi-1.1/opal/event' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory `/tmp/openmpi-1.1/opal/event' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/tmp/openmpi-1.1/opal' > make: *** [all-recursive] Error 1 That's a bit unexpected. Can you please send us the information requested in our "Getting Help" section of the web page: http://www.open-mpi.org/community/help/ It will help immensely in determining what went wrong. Thanks, Brian