[OMPI devel] BML problem?
I'm not exactly sure where the fix to this should be, but I think I've found a problem. Consider, for illustration, launching a multi-process job on a single node. The function mca_bml_r2_add_procs() calls mca_btl_sm_add_procs() Each process could conceivably return a different value -- OMPI_SUCCESS or otherwise. E.g., if there isn't enough room for all to allocate all the shared memory they need, early processes might succeed in their allocations while laggards won't. The fact that some processes fail doesn't bother the BML. It just loops over other BTLs and, quite possibly, finds another BTL to make needed connections. Is this a problem? It seems to me to be, but I haven't yet figured out what the BML does next. I'm guessing it ends up with a hodgepodge of BTLs. E.g., A talks to B via sm, but B talks to A via tcp. And, I'm still guessing, this produces badness (like hangs). Comments?
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Fixed in r20120. george. On Dec 11, 2008, at 19:14 , Brian Barrett wrote: I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrauseDate: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
I think that's a reasonable solution. However, the words "not it" come to mind. Sorry, but I have way too much on my plate this month. By the way, in case no one noticed, I had e-mailed my findings to devel. Someone might want to reply to Dorian's e-mail on users. Brian On Dec 11, 2008, at 2:31 PM, George Bosilca wrote: Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrauseDate: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Fwd: [OMPI users] Onesided + derived datatypes
Brian, You're right, the datatype is being too cautious with the boundaries when detecting the overlap. There is no good solution to detect the overlap except parsing the whole memory layout to check the status of every predefined type. As one can imagine this is a very expensive operation. This is reason I preferred to use the true extent and the size of the data to try to detect the overlap. This approach is a lot faster, but has a poor accuracy. The best solution I can think of in short term is to remove completely the overlap check. This will have absolutely no impact on the way we pack the data, but can lead to unexpected results when we unpack and the data overlap. But I guess this can be considered as a user error, as the MPI standard clearly state that the result of such an operation is ... unexpected. george. On Dec 10, 2008, at 22:20 , Brian Barrett wrote: Hi all - I looked into this, and it appears to be datatype related. If the displacements are set t o 3, 2, 1, 0, there the datatype will fail the type checks for one-sided because is_overlapped() returns 1 for the datatype. My reading of the standard seems to indicate this should not be. I haven't looked into the problems with displacement set to 0, 1, 2, 3, but I'm guessing it has something to do with the reverse problem. This looks like a datatype issue, so it's out of my realm of expertise. Can someone else take a look? Brian Begin forwarded message: From: doriankrauseDate: December 10, 2008 4:07:55 PM MST To: us...@open-mpi.org Subject: [OMPI users] Onesided + derived datatypes Reply-To: Open MPI Users Hi List, I have a MPI program which uses one sided communication with derived datatypes (MPI_Type_create_indexed_block). I developed the code with MPICH2 and unfortunately didn't thought about trying it out with OpenMPI. Now that I'm "porting" the Application to OpenMPI I'm facing some problems. On the most machines I get an SIGSEGV in MPI_Win_fence, sometimes an invalid datatype shows up. I ran the program in Valgrind and didn't get anything valuable. Since I can't see a reason for this problem (at least if I understand the standard correctly), I wrote the attached testprogram. Here are my experiences: * If I compile without ONESIDED defined, everything works and V1 and V2 give the same results * If I compile with ONESIDED and V2 defined (MPI_Type_contiguous) it works. * ONESIDED + V1 + O2: No errors but obviously nothing is send? (Am I in assuming that V1+O2 and V2 should be equivalent?) * ONESIDED + V1 + O1: [m02:03115] *** An error occurred in MPI_Put [m02:03115] *** on win [m02:03115] *** MPI_ERR_TYPE: invalid datatype [m02:03115] *** MPI_ERRORS_ARE_FATAL (goodbye) I didn't get a segfault as in the "real life example" but if ompitest.cc is correct it means that OpenMPI is buggy when it comes to onesided communication and (some) derived datatypes, so that it is probably not of problem in my code. I'm using OpenMPI-1.2.8 with the newest gcc 4.3.2 but the same behaviour can be be seen with gcc-3.3.1 and intel 10.1. Please correct me if ompitest.cc contains errors. Otherwise I would be glad to hear how I should report these problems to the develepors (if they don't read this). Thanks + best regards Dorian ___ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
I'm quite sure that the CM CPC stuff (both IBCM -- which doesn't fully work anyway -- and RDMA CM) will timeout and Bad Things will happen if you interrupt it in the middle of some network transactions. The (kernel-imposed) timeout for RDMACM is pretty short -- on the order of a minute or two. On Dec 11, 2008, at 3:19 PM, Josh Hursey wrote: I would expect that you will hit problems with timeouts throughout the codebase as Jeff mentioned, particularly with network connections. Having a 'prepare to suspend' signal followed by a 'suspend now' signal might work since it should provide enough of a window to ready the application for the suspension. I think the first step is to try it, being sure to let the process stay suspended for a considerable amount of time (15 min to an hour at least) and see what effects this has. I would expect a series of errors, but I haven't tried it so I can't say for sure. If there are errors then looking at the internal notification stuff that the C/R mechanism uses would be a good place to start since it handles these types of issues for a checkpoint operation. -- Josh On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote: On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote: Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. My point is that the right approach might be to work in the context of Josh's CR stuff -- he's already got hooks for "do this right before pausing for checkpoint" / "do this right after resuming", etc. Sure, we're not checkpointing, but several of the characteristics of this action are pretty similar to what is required for checkpointing/restarting. So it might be good to use that framework for it...? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
I would expect that you will hit problems with timeouts throughout the codebase as Jeff mentioned, particularly with network connections. Having a 'prepare to suspend' signal followed by a 'suspend now' signal might work since it should provide enough of a window to ready the application for the suspension. I think the first step is to try it, being sure to let the process stay suspended for a considerable amount of time (15 min to an hour at least) and see what effects this has. I would expect a series of errors, but I haven't tried it so I can't say for sure. If there are errors then looking at the internal notification stuff that the C/R mechanism uses would be a good place to start since it handles these types of issues for a checkpoint operation. -- Josh On Dec 11, 2008, at 3:08 PM, Jeff Squyres wrote: On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote: Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. My point is that the right approach might be to work in the context of Josh's CR stuff -- he's already got hooks for "do this right before pausing for checkpoint" / "do this right after resuming", etc. Sure, we're not checkpointing, but several of the characteristics of this action are pretty similar to what is required for checkpointing/ restarting. So it might be good to use that framework for it...? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
On Dec 11, 2008, at 2:55 PM, Terry Dontje wrote: Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. My point is that the right approach might be to work in the context of Josh's CR stuff -- he's already got hooks for "do this right before pausing for checkpoint" / "do this right after resuming", etc. Sure, we're not checkpointing, but several of the characteristics of this action are pretty similar to what is required for checkpointing/ restarting. So it might be good to use that framework for it...? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
Jeff Squyres wrote: On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the daemons or else you won't be able to CONT the job. I'm not sure how big a deal that is - I can't think of any issue it creates offhand. Is there any issue in the MPI comm layers if you abruptly STOP a process while it's communicating? Especially since the STOP is going to be asynchronous. Do you need to quiet networks like IB first? It might be better to allow the MPI procs to do "something" before actually stopping. This might prevent timeout-sensitive stuff from failing (although I don't know if Josh's CR code even handles these kinds of things...?). The obvious case that I can think of is if the MPI process is stopped in the middle of an openib CM action. None of the openib CPC's can currently handle a timeout nicely. Well under SGE it allows you to have SGE send mpirun SIGUSR1 so many minutes before sending the Suspend signal. --td
Re: [OMPI devel] Forwarding SIGTSTP and SIGCONT
On Dec 8, 2008, at 11:11 AM, Ralph Castain wrote: It sounds reasonable to me. I agree with Ralf W about having mpirun send a STOP to itself - that would seem to solve the problem about stopping everything. It would seem, however, that you cannot similarly STOP the daemons or else you won't be able to CONT the job. I'm not sure how big a deal that is - I can't think of any issue it creates offhand. Is there any issue in the MPI comm layers if you abruptly STOP a process while it's communicating? Especially since the STOP is going to be asynchronous. Do you need to quiet networks like IB first? It might be better to allow the MPI procs to do "something" before actually stopping. This might prevent timeout-sensitive stuff from failing (although I don't know if Josh's CR code even handles these kinds of things...?). The obvious case that I can think of is if the MPI process is stopped in the middle of an openib CM action. None of the openib CPC's can currently handle a timeout nicely. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] BTL move - the notion
(chiming in a bit after the fact) In general, I agree with most of what has been stated. 1. The BTLs should remain "owned" by Open MPI. There are OMPI member organizations in multiple projects that want to use the BTLs, but the BTLs are primarily for the Open MPI project. 2. An incremental patch approach would likely be best; my definition of that would be "small branch and merge". I strongly endorse hg or git for this; they are *VERY* good at exactly this kind of thing. Much, much larger code bases than OMPI pervasively use hg/git for the branch/patch/merge model with very good success. If you "grew up" on CVS/SVN (and earlier), this may seem counter-intuitive -- but please realize that tools have evolved significantly since then. 3. Moving the BTL code to different parts of the source tree won't matter much in terms of performance and (mostly) abstractions. But we should check, just to make sure we didn't muck something up. This is a complex code base, after all. 4. Adding new functionality to the BTL (e.g., bootstrapping) is subject to #1. 5. Ralph outlined the case for tighter integration with the RTE and the BTLs. I think it's reasonable, and I agree with his case. We can add abstractions to ensure that nothing is ORTE-specific and to ensure that we can safely handle if some other underlying RTE doesn't have the same capabilities (none of this stuff is likely to be in the performance-critical code path, so it's not too much of an issue). But allowing other RTE's under the OMPI MPI layer shouldn't restrict what we want/can do with our own OMPI-specific RTE. Just my $0.02 On Dec 5, 2008, at 11:10 AM, Richard Graham wrote: > think we all agree that STCI and OMPI have different objectives and requirements. OMPI is facing the need to launch and operate at extreme scales by next summer, has received a lot of interest in having it report errors into various systems, etc. We don't have all the answers as to what will be necessary to meet these requirements, but indications so far are that tighter integration, not deeper abstraction, between the various layers will be needed. By that, I don't mean we will violate abstraction layers, but rather that the various layers need to work more as a tightly tuned instrument, with each layer operating based on a clear knowledge of how the other layers are functioning. OMPI and STCI are two different things together, and I have vested interest in both, and have no desire to have either go south. You have a set of requirement at LANL which are important, and we also have a set of requirement at ORNL, and as such we need to compromise on these in the code base. We have MPI level goals, which will be accomplished in the OMPI code base, and tools and other related goals that will be accomplished in other code bases. We both have the need to function well at the high end, so have the same set of goals there. > > For example, for modex-less operations, the MPI/BTLs have to know that the RTE/OS will be providing certain information. This means that they don't have to go out and discover it themselves every time. Yes, we will leave that as the default behavior so that small and/or unmanaged clusters can operate, but we have to also introduce logic that can detect when we are utilizing this alternative capability and exploit it. While we are trying our best to avoid introducing RTE-like calls into the code, the fact is that we may well have to do so (we have already identified one btl that will definitely need to). It is simply too early to make the decision to cut that off now - we don't know what the long-term impacts of such a decision will be. This is where discussions will need to go both ways. Your changes also can impact us, and we need to agree to those changes, just as much as you need to agree with the changes we are proposing. This is not a code base focused on a single institution's requirements, and we all do our best (and I believe tend to succeed) at helping meet all of our needs. > > Finally, although I don't do much on the MPI layer, I am concerned about performance. I would tend to oppose any additional abstraction until we can measure the performance impact. Thus, I would like to see the BTL move done on a tmp branch (technology to branch up to the implementer - I don't care) so we can verify that it isn't hurting us in some unforeseeable manner. Agreed - at least for the last phase of what we are suggesting, but we can talk about this. I am a bit confused about how the location of the source code has anything to do with how it performs at run-time. At this stage we have said nothing about changing the way the btl works, just cosmetic things. When it comes to enabling the use of stci with ompi, then these issues will come up, and need to be addressed very carefully. To be honest, since we
Re: [OMPI devel] 1.3 staging area?
I think that this sounds reasonable. It's actually not too much of a change from the existing CMR process: - if your commit is applicable to the trunk, do so *** if you intend your commit to go to the v1.3 branch, also commit it there (potentially adjusting the patch to commit cleanly in the v1.3 staging area) - file a CMR for the r number in the v1.3 staging area - the release tech will merge the v1.3 staging commit to the v1.3 tree *** is the only new step. On Dec 10, 2008, at 5:55 PM, Ralph Castain wrote: Hi all I'm a tad concerned about our ability to test proposed CMR's for the 1.3 branch. Given the long delays in getting 1.3 out, and the rapidly looming 1.4 milestones that many of us have in our individual projects, it is clear that the trunk is going to quickly diverge significantly from what is in the 1.3 branch. In addition, we are going to see quite a few commits occurring within a restricted time period. Thus, the fact that some proposed change does or does not pass MTT tests on the trunk at some given point in time is no longer a reliable indicator of its behavior in 1.3. Likewise, it will be difficult to isolate that "this commit is okay" when MTT can really only tell us the state of the aggregated code base. Let me hasten to point out that this has been a recurring problem with every major release. We have discussed the problem on several occasions, but failed to reach consensus on a solution. I would like to propose that we create a 1.3 staging branch. This branch would be opened on an individual-at-a-time basis for them to commit proposed CMR's for the 1.3 branch. We would ask that people please include the staging branch in their MTT testing on occasions when a change has been made. Once the proposed change has been validated, then it can be brought over as a single (and easy) merge to the 1.3 release branch. I realize this may slow the passage of bug fixes somewhat, and obviously we should apply this on a case-by-case basis (e.g., a simple removal of an unused variable would hardly merit such a step). However, I believe that something like the IOF patch that needs to eventually move to 1.3, and the Windows upgrade, are examples that probably do merit this step. Just a suggestion - hope it helps. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems