[OMPI devel] Which tests for larger cluster testing
I am curious which tests are being used when running tests on larger clusters. And by larger clusters, I mean anything with np > 128. (Although I realize that is not very large, but it is bigger than most of the clusters I assume tests are being run on) I ask this because I planned on using some of the intel tests, but they clearly have limitations starting at np=64. To avoid mailing list clutter, feel free to just email me and I will summarize. Rolf
Re: [OMPI devel] [devel-core] [RFC] Exit without finalize
Sounds great to me. Aurelien Le 11 sept. 07 à 13:03, Jeff Squyres a écrit : If you genericize the concept, I think it's compatible with FT: 1. during MPI_INIT, one of the MPI processes can request a "notify" exit pattern for the job: a process must notify the RTE before it actually exits (i.e., some ORTE notification during MPI_FINALIZE). If a process exits before notifying the RTE, it's an error. 1a. The default action upon error can be to kill the entire job. 1b. If you desire plug-in-able error actions (e.g., not kill the entire job), I'm *assuming* that our plugin frameworks can handle that...? 2. for an FT MPI job, I assume that the MPI processes would either not perform step 1 (i.e., the default action upon process exit is nothing -- just like if you had run "mpirun -np 4 hostname"), or you would select a specific action upon error/plugin for what to do when a process exits without first notifying the RTE. Howzat? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [devel-core] [RFC] Exit without finalize
On Sep 8, 2007, at 2:33 PM, Aurelien Bouteiller wrote: I agree (b) is not a good idea. However I am not very pleased by (a) either. It totally prevent any process Fault Tolerant mechanism if we go that way. If we plan to add some failure detection mechanism to RTE and failure management (to avoid Finalize to hang), we should add the ability to plug-in FT specific error handlers. The default error handler should do exactly what is proposed by Ralph, but nowhere else (than in this handler) the RTE code should assume that the application is aborting when a failure occurs. If it is a FT application it might just not abort and recover. (b) sounds fine to me. If you genericize the concept, I think it's compatible with FT: 1. during MPI_INIT, one of the MPI processes can request a "notify" exit pattern for the job: a process must notify the RTE before it actually exits (i.e., some ORTE notification during MPI_FINALIZE). If a process exits before notifying the RTE, it's an error. 1a. The default action upon error can be to kill the entire job. 1b. If you desire plug-in-able error actions (e.g., not kill the entire job), I'm *assuming* that our plugin frameworks can handle that...? 2. for an FT MPI job, I assume that the MPI processes would either not perform step 1 (i.e., the default action upon process exit is nothing -- just like if you had run "mpirun -np 4 hostname"), or you would select a specific action upon error/plugin for what to do when a process exits without first notifying the RTE. Howzat? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] UD BTL alltoall hangs
First off, I've managed to reproduce this with nbcbench using only 16 procs (two per node), and setting btl_ofud_sd_num to 12 -- eases debugging with fewer procs to look at. ompi_coll_tuned_alltoall_intra_basic_linear is the alltoall routine that is being called. What I'm seeing from totalview is that some random number of procs (1-5 usually, varies from run to run) are sitting with a send and a recv outstanding to every other proc. The other procs however have moved on to the next collective. This is hard to see with the default nbcbench code since it calls only alltoall repeatedly -- adding a barrier after the MPI_Alltoall() call makes it easier to see, as the barrier has a different tag number and communication pattern. So what I see is a few procs stuck in alltoall, while the rest are waiting in the following barrier. I've also verified with totalview that there are no outstanding send wqe's at the UD BTL, and all procs are polling progress. The procs in the alltoall are polling in the opal_condition_wait() called from ompi_request_wait_all(). Not sure what to ask or where to look further other than, what should I look at to see what requests are outstanding in the PML? Andrew George Bosilca wrote: The first step will be to figure out which version of the alltoall you're using. I suppose you use the default parameters, and then the decision function in the tuned component say it is using the linear all to all. As the name state it, this means that every node will post one receive from any other node and then will start sending to every other node the respective fragment. This will lead to a lot of outstanding sends and receives. I doubt that the receive can cause a problem, so I expect the problem is coming from the send side. Do you have TotalView installed on your odin ? If yes there is a simple way to see how many sends are pending and where ... That might pinpoint [at least] the process where you should look to see what' wrong. george. On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote: I'm having a problem with the UD BTL and hoping someone might have some input to help solve it. What I'm seeing is hangs when running alltoall benchmarks with nbcbench or an LLNL program called mpiBench -- both hang exactly the same way. With the code on the trunk running nbcbench on IU's odin using 32 nodes and a command line like this: mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p 128-128 -s 1-262144 hangs consistently when testing 256-byte messages. There are two things I can do to make the hang go away until running at larger scale. First is to increase the 'btl_ofud_sd_num' MCA param from its default value of 128. This allows you to run with more procs/nodes before hitting the hang, but AFAICT doesn't fix the actual problem. What this parameter does is control the maximum number of outstanding send WQEs posted at the IB level -- when the limit is reached, frags are queued on an opal_list_t and later sent by progress as IB sends complete. The other way I've found is to play games with calling mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send (). In fact I replaced the CHECK_FRAG_QUEUES() macro used around btl_ofud_endpoint.c:77 with a version that loops on progress until a send WQE slot is available (as opposed to queueing). Same result -- I can run at larger scale, but still hit the hang eventually. It appears that when the job hangs, progress is being polled very quickly, and after spinning for a while there are no outstanding send WQEs or queued sends in the BTL. I'm not sure where further up things are spinning/blocking, as I can't produce the hang at less than 32 nodes / 128 procs and don't have a good way of debugging that (suggestions appreciated). Furthermore, both ob1 and dr PMLs result in the same behavior, except that DR eventually trips a watchdog timeout, fails the BTL, and terminates the job. Other collectives such as allreduce and allgather do not hang -- only alltoall. I can also reproduce the hang on LLNL's Atlas machine. Can anyone else reproduce this (Torsten might have to make a copy of nbcbench available)? Anyone have any ideas as to what's wrong? Andrew ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Adding a new component
Hi Aurelien, Thank you for the pointers. I was able to plug in a component to an existing framework. Thanks again, Sajjad Aurelien BouteillerSent by: devel-boun...@open-mpi.org 09/08/07 01:34 PM Please respond to Open MPI Developers To Open MPI Developers cc Subject Re: [OMPI devel] Adding a new component Hi Sajjad, First it will depend wether you are writing a new component in an existing framework (let say you are writing a new BTL for a new type of interconnect) or a totally new framework (you want to have a family of component that can manage a totally new functionality in Open MPI). In each Framework there is a "base" which take care of the component selection process. If you are just adding a component, you will just need to provide a mca_mycomponent_init(bool enable_progress_threads, bool enable_mpi_threads) as described in the mca_component_t structure. The mca_framework_base_select will then take care of everything for you. If you want to add a new framework you'll have to create a selection function by yourself (all along with a full bunch of other functions to populate the base of the framework). I'll give you more details on this if it is relevant for you, just ask. Aurelien Le 7 sept. 07 à 17:21, Sajjad Tabib a écrit : Hi, I am a complete newbie to Open MPI internals and just began browsing the code and reading up on slides and papers. From what I have read, I learned that I have to create a new component. What I do not know is how to make MPI aware of it or should I say make MPI open and select my component. I found a set of slides that briefly went over adding components. For example, it briefly described that I must add PARAM_INIT_FILE and PARAM_CONFIG_FILES options in configure.params, but I'm not sure what these mean. Does anybody know of any tutorials/documents that could help me with this? Any help is greatly appreciated. S Tabib ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Coverity
David fixed a problem this morning that Coverity wasn't quite running right because the directory where OMPI lived was changing every night. So a few of the old runs were pruned. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote: Gleb, in the scenario which you describe in the comment to the patch, what should happen is, that the communicator with the cid which started already the allreduce will basically 'hang' until the other processes 'allow' the lower cids to continue. It should basically be blocked in the allreduce. Why? Two threads are allowed to run allreduce simultaneously for different communicators. Are they? They are, but they might never agree on the cid. This is simply how the algorithm was designed originally -- which does not mean, that it has to remain this way, just to explain its behavior and the intent. See the design doc for that in ompi-docs in the January 2004 repository. Lets assume, that we have n procs with 2 threads, and both threads do a comm_create at the same time, input cid 1 and cid 2. N-1 processes let cid 1 start because that's the lower number. However, one process let cid 2 start because the other thread was late. What would happen in the algorithm is nobody responds to cid 2, so it would hang. As soon as the other thread with cid 1 enters the comm_create, it would be allowed to run, this operation would finish. The other threads would then allow cid 2 to enter, the 'hanging' process would be released. However, here is something, where we might have problems with the sun thread tests (and we discussed that with Terry already): the cid allocation algorithm as implemented in Open MPI assumes ( -- this was/is my/our understanding of the standard --) that a communicator creation is a collective operation. This means, you can not have a comm_create and another allreduce of the same communicator running in different threads, because these allreduces will mix up and produce non-sense results. We fixed the case, if all collective operations are comm_creates, but if some of the threads are in a comm_create and some are in allreduce on the same communicator, it won't work. Correct, but this is not what happens with mt_coll test. mt_coll calls commdup on the same communicator in different threads concurrently, but we handle this case inside ompi_comm_nextcid(). Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: Gleb, This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. george. On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) New Revision: 16088 URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 Log: The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
On Tue, Sep 11, 2007 at 11:30:53AM -0400, George Bosilca wrote: > > On Sep 11, 2007, at 11:05 AM, Gleb Natapov wrote: > >> On Tue, Sep 11, 2007 at 10:54:25AM -0400, George Bosilca wrote: >>> We don't want to prevent two thread from entering the code is same time. >>> The algorithm you cited support this case. There is only one moment that >>> is >> Are you sure it support this case? There is a global var mask_in_use >> that prevent multiple access. > > I'm unable to find the mask_in_use global variable. Where it is ? I thought by "algorithm you cited" you meant the algorithm described in the link I provided. There is mask_in_use global var there that IMO ensure that algorithm is executed for only one communicator simultaneously. > > george. > >> >>> critical. The local selection of the next available cid. And this is what >>> we try to protect there. If after the first run, the collective call do >>> not >>> manage to figure out the correct next_cid then we will execute the while >>> loop again. And then this condition make sense, as only the thread >>> running >>> on the smallest communicator cid will continue. This insure that it will >>> pickup the smallest next available cid, and then it's reduce operation >>> will >>> succeed. The other threads will wait until the selection of the next >>> available cid is unlocked. >>> >>> Without the code you removed we face a deadlock situation. Multiple >>> threads >>> will pick different next_cid on each process and thy will never succeed >>> with the reduce operation. And this is what we're trying to avoid with >>> the >>> test. >> OK. I think now I get the idea behind this test. I'll restore it and >> leave ompi_comm_unregister_cid() fix in place. Is this OK? >> >>> >>> george. >>> >>> On Sep 11, 2007, at 10:34 AM, Gleb Natapov wrote: >>> On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: > Gleb, > > This patch is not correct. The code preventing the registration of the > same > communicator twice is later in the code (same file in the function > ompi_comm_register_cid line 326). Once the function > ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. > is called, we know that each communicator only handle one "communicator > creation" function at the same time. Therefore, we want to give > priority > to > the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. > > Without the condition in the ompi_comm_register_cid (each communicator > only > get registered once) your comment make sense. However, with the > condition > your patch allow a dead end situation, while 2 processes try to create > communicators in multiple threads, and they will never succeed, simply > because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. > > george. > > > > On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: > >> Author: gleb >> Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) >> New Revision: 16088 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 >> >> Log: >> The code tries to prevent itself from running for more then one >> communicator >> simultaneously, but is doing it incorrectly. If the function is >> running >> already >> for one communicator and it is called from another thread for other >> communicator >> with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() >> will fail and the function will be executed for two different >> communicators by >> two threads simultaneously. There is nothing in the algorithm that >> prevent >> it >> from been running simultaneously for different communicators as far as >> I >> can see, >>
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
On Sep 11, 2007, at 11:05 AM, Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:54:25AM -0400, George Bosilca wrote: We don't want to prevent two thread from entering the code is same time. The algorithm you cited support this case. There is only one moment that is Are you sure it support this case? There is a global var mask_in_use that prevent multiple access. I'm unable to find the mask_in_use global variable. Where it is ? george. critical. The local selection of the next available cid. And this is what we try to protect there. If after the first run, the collective call do not manage to figure out the correct next_cid then we will execute the while loop again. And then this condition make sense, as only the thread running on the smallest communicator cid will continue. This insure that it will pickup the smallest next available cid, and then it's reduce operation will succeed. The other threads will wait until the selection of the next available cid is unlocked. Without the code you removed we face a deadlock situation. Multiple threads will pick different next_cid on each process and thy will never succeed with the reduce operation. And this is what we're trying to avoid with the test. OK. I think now I get the idea behind this test. I'll restore it and leave ompi_comm_unregister_cid() fix in place. Is this OK? george. On Sep 11, 2007, at 10:34 AM, Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: Gleb, This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp:// info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI +communicator+dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. george. On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) New Revision: 16088 URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 Log: The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as I can see, but ompi_comm_unregister_cid() assumes that it is always called for a communicator with the lowest cid and this is not always the case. This patch removes bogus lowest cid check and fix ompi_comm_register_cid() to properly remove cid from the list. Text files modified: trunk/ompi/communicator/comm_cid.c |24 +++ + 1 files changed, 12 insertions(+), 12 deletions(-) Modified: trunk/ompi/communicator/comm_cid.c == --- trunk/ompi/communicator/comm_cid.c (original) +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) @@ -11,6 +11,7 @@ * All rights reserved. *
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote: > Gleb, > > in the scenario which you describe in the comment to the patch, what > should happen is, that the communicator with the cid which started > already the allreduce will basically 'hang' until the other processes > 'allow' the lower cids to continue. It should basically be blocked in > the allreduce. Why? Two threads are allowed to run allreduce simultaneously for different communicators. Are they? > > However, here is something, where we might have problems with the sun > thread tests (and we discussed that with Terry already): the cid > allocation algorithm as implemented in Open MPI assumes ( -- this was/is > my/our understanding of the standard --) that a communicator creation is > a collective operation. This means, you can not have a comm_create and > another allreduce of the same communicator running in different threads, > because these allreduces will mix up and produce non-sense results. We > fixed the case, if all collective operations are comm_creates, but if > some of the threads are in a comm_create and some are in allreduce on > the same communicator, it won't work. Correct, but this is not what happens with mt_coll test. mt_coll calls commdup on the same communicator in different threads concurrently, but we handle this case inside ompi_comm_nextcid(). > > > Gleb Natapov wrote: > > On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: > >> Gleb, > >> > >> This patch is not correct. The code preventing the registration of the > >> same > >> communicator twice is later in the code (same file in the function > >> ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid > > I saw this code and the comment. The problem is not with the same > > communicator but with different communicators. > > > >> is called, we know that each communicator only handle one "communicator > >> creation" function at the same time. Therefore, we want to give priority > >> to > >> the smallest com_id, which is what happens in the code you removed. > > The code I removed was doing it wrongly. I.e the algorithm sometimes is > > executed > > for different communicators simultaneously by different threads. Think > > about the case where the function is running for cid 1 and then another > > thread runs it for cid 0. cid 0 will proceed although the function is > > executed on another CPU. And this is not something theoretical, that > > is happening with sun's thread test suit mpi_coll test case. > > > >> Without the condition in the ompi_comm_register_cid (each communicator > >> only > >> get registered once) your comment make sense. However, with the condition > >> your patch allow a dead end situation, while 2 processes try to create > >> communicators in multiple threads, and they will never succeed, simply > >> because they will not order the creation based on the com_id. > > If the algorithm is really prone to deadlock in case it is concurrently > > executed for several different communicators (I haven't check this), > > then we may want to fix original code to really prevent two threads to > > enter the function, but then I don't see the reason for all those > > complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() > > The algorithm described here: > > http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm=en=clnk=2 > > in section 5.3 works without it and we can do something similar. > > > >> george. > >> > >> > >> > >> On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: > >> > >>> Author: gleb > >>> Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) > >>> New Revision: 16088 > >>> URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 > >>> > >>> Log: > >>> The code tries to prevent itself from running for more then one > >>> communicator > >>> simultaneously, but is doing it incorrectly. If the function is running > >>> already > >>> for one communicator and it is called from another thread for other > >>> communicator > >>> with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() > >>> will fail and the function will be executed for two different > >>> communicators by > >>> two threads simultaneously. There is nothing in the algorithm that > >>> prevent > >>> it > >>> from been running simultaneously for different communicators as far as I > >>> can see, > >>> but ompi_comm_unregister_cid() assumes that it is always called for a > >>> communicator > >>> with the lowest cid and this is not always the case. This patch removes > >>> bogus > >>> lowest cid check and fix ompi_comm_register_cid() to properly remove cid > >>> from > >>> the list. > >>> > >>> Text files modified: > >>>trunk/ompi/communicator/comm_cid.c |24 > >>>1 files changed, 12 insertions(+), 12 deletions(-) > >>> > >>> Modified:
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
Gleb, in the scenario which you describe in the comment to the patch, what should happen is, that the communicator with the cid which started already the allreduce will basically 'hang' until the other processes 'allow' the lower cids to continue. It should basically be blocked in the allreduce. However, here is something, where we might have problems with the sun thread tests (and we discussed that with Terry already): the cid allocation algorithm as implemented in Open MPI assumes ( -- this was/is my/our understanding of the standard --) that a communicator creation is a collective operation. This means, you can not have a comm_create and another allreduce of the same communicator running in different threads, because these allreduces will mix up and produce non-sense results. We fixed the case, if all collective operations are comm_creates, but if some of the threads are in a comm_create and some are in allreduce on the same communicator, it won't work. Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: Gleb, This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. george. On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) New Revision: 16088 URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 Log: The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as I can see, but ompi_comm_unregister_cid() assumes that it is always called for a communicator with the lowest cid and this is not always the case. This patch removes bogus lowest cid check and fix ompi_comm_register_cid() to properly remove cid from the list. Text files modified: trunk/ompi/communicator/comm_cid.c |24 1 files changed, 12 insertions(+), 12 deletions(-) Modified: trunk/ompi/communicator/comm_cid.c == --- trunk/ompi/communicator/comm_cid.c (original) +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) @@ -11,6 +11,7 @@ * All rights reserved. * Copyright (c) 2006-2007 University of Houston. All rights reserved. * Copyright (c) 2007 Cisco, Inc. All rights reserved. + * Copyright (c) 2007 Voltaire All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -170,15 +171,6 @@ * This is the real algorithm described in the doc */ -OPAL_THREAD_LOCK(_cid_lock); -if (comm->c_contextid != ompi_comm_lowest_cid() ) { -/* if not
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
We don't want to prevent two thread from entering the code is same time. The algorithm you cited support this case. There is only one moment that is critical. The local selection of the next available cid. And this is what we try to protect there. If after the first run, the collective call do not manage to figure out the correct next_cid then we will execute the while loop again. And then this condition make sense, as only the thread running on the smallest communicator cid will continue. This insure that it will pickup the smallest next available cid, and then it's reduce operation will succeed. The other threads will wait until the selection of the next available cid is unlocked. Without the code you removed we face a deadlock situation. Multiple threads will pick different next_cid on each process and thy will never succeed with the reduce operation. And this is what we're trying to avoid with the test. george. On Sep 11, 2007, at 10:34 AM, Gleb Natapov wrote: On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: Gleb, This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp:// info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator +dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. george. On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) New Revision: 16088 URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 Log: The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as I can see, but ompi_comm_unregister_cid() assumes that it is always called for a communicator with the lowest cid and this is not always the case. This patch removes bogus lowest cid check and fix ompi_comm_register_cid() to properly remove cid from the list. Text files modified: trunk/ompi/communicator/comm_cid.c |24 +++ + 1 files changed, 12 insertions(+), 12 deletions(-) Modified: trunk/ompi/communicator/comm_cid.c == --- trunk/ompi/communicator/comm_cid.c (original) +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) @@ -11,6 +11,7 @@ * All rights reserved. * Copyright (c) 2006-2007 University of Houston. All rights reserved. * Copyright (c) 2007 Cisco, Inc. All rights reserved. + * Copyright (c) 2007 Voltaire All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -170,15 +171,6 @@ * This is the real algorithm described in the doc */ -OPAL_THREAD_LOCK(_cid_lock); -if (comm->c_contextid !=
Re: [MTT devel] [MTT svn] svn:mtt-svn r998
Ethan -- Could you show the use case that motivated this change? Thanks. On Sep 7, 2007, at 11:52 AM, emall...@osl.iu.edu wrote: Author: emallove Date: 2007-09-07 11:52:04 EDT (Fri, 07 Sep 2007) New Revision: 998 URL: https://svn.open-mpi.org/trac/mtt/changeset/998 Log: Escape the Perl regular expression quantifiers in `::OMPI::find_network` (for test names such as `mpic++`). Text files modified: tmp/jms-new-parser/lib/MTT/Values/Functions/MPI/OMPI.pm | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) Modified: tmp/jms-new-parser/lib/MTT/Values/Functions/MPI/OMPI.pm == --- tmp/jms-new-parser/lib/MTT/Values/Functions/MPI/OMPI.pm (original) +++ tmp/jms-new-parser/lib/MTT/Values/Functions/MPI/OMPI.pm 2007-09-07 11:52:04 EDT (Fri, 07 Sep 2007) @@ -98,6 +98,9 @@ # Ignore argv[0] $str =~ s/^\s*\S+\s*(.+)$/\1/; +# Escape the quantifiers (for test names such as "mpi2c++") +$final =~ s/(\?|\*|\+|\{|\})/\\$1/g; + # Ignore everything beyond $final $str =~ s/^(.+)\s*$final.+$/\1/; Debug("Examining: $str\n"); ___ mtt-svn mailing list mtt-...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-svn -- Jeff Squyres Cisco Systems
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote: > Gleb, > > This patch is not correct. The code preventing the registration of the same > communicator twice is later in the code (same file in the function > ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid I saw this code and the comment. The problem is not with the same communicator but with different communicators. > is called, we know that each communicator only handle one "communicator > creation" function at the same time. Therefore, we want to give priority to > the smallest com_id, which is what happens in the code you removed. The code I removed was doing it wrongly. I.e the algorithm sometimes is executed for different communicators simultaneously by different threads. Think about the case where the function is running for cid 1 and then another thread runs it for cid 0. cid 0 will proceed although the function is executed on another CPU. And this is not something theoretical, that is happening with sun's thread test suit mpi_coll test case. > > Without the condition in the ompi_comm_register_cid (each communicator only > get registered once) your comment make sense. However, with the condition > your patch allow a dead end situation, while 2 processes try to create > communicators in multiple threads, and they will never succeed, simply > because they will not order the creation based on the com_id. If the algorithm is really prone to deadlock in case it is concurrently executed for several different communicators (I haven't check this), then we may want to fix original code to really prevent two threads to enter the function, but then I don't see the reason for all those complications with ompi_comm_register_cid()/ompi_comm_unregister_cid() The algorithm described here: http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm=en=clnk=2 in section 5.3 works without it and we can do something similar. > > george. > > > > On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: > >> Author: gleb >> Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) >> New Revision: 16088 >> URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 >> >> Log: >> The code tries to prevent itself from running for more then one >> communicator >> simultaneously, but is doing it incorrectly. If the function is running >> already >> for one communicator and it is called from another thread for other >> communicator >> with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() >> will fail and the function will be executed for two different >> communicators by >> two threads simultaneously. There is nothing in the algorithm that prevent >> it >> from been running simultaneously for different communicators as far as I >> can see, >> but ompi_comm_unregister_cid() assumes that it is always called for a >> communicator >> with the lowest cid and this is not always the case. This patch removes >> bogus >> lowest cid check and fix ompi_comm_register_cid() to properly remove cid >> from >> the list. >> >> Text files modified: >>trunk/ompi/communicator/comm_cid.c |24 >>1 files changed, 12 insertions(+), 12 deletions(-) >> >> Modified: trunk/ompi/communicator/comm_cid.c >> == >> --- trunk/ompi/communicator/comm_cid.c (original) >> +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, >> 11 >> Sep 2007) >> @@ -11,6 +11,7 @@ >> * All rights reserved. >> * Copyright (c) 2006-2007 University of Houston. All rights reserved. >> * Copyright (c) 2007 Cisco, Inc. All rights reserved. >> + * Copyright (c) 2007 Voltaire All rights reserved. >> * $COPYRIGHT$ >> * >> * Additional copyrights may follow >> @@ -170,15 +171,6 @@ >> * This is the real algorithm described in the doc >> */ >> >> -OPAL_THREAD_LOCK(_cid_lock); >> -if (comm->c_contextid != ompi_comm_lowest_cid() ) { >> -/* if not lowest cid, we do not continue, but sleep and >> try again */ >> -OPAL_THREAD_UNLOCK(_cid_lock); >> -continue; >> -} >> -OPAL_THREAD_UNLOCK(_cid_lock); >> - >> - >> for (i=start; i < mca_pml.pml_max_contextid ; i++) { >> >> flag=ompi_pointer_array_test_and_set_item(_mpi_communicators, >>i, comm); >> @@ -365,10 +357,18 @@ >> >> static int ompi_comm_unregister_cid (uint32_t cid) >> { >> -ompi_comm_reg_t *regcom=NULL; >> -opal_list_item_t >> *item=opal_list_remove_first(_registered_comms); >> +ompi_comm_reg_t *regcom; >> +opal_list_item_t *item; >> >> -regcom = (ompi_comm_reg_t *) item; >> +for (item = opal_list_get_first(_registered_comms);
Re: [OMPI devel] [OMPI svn-full] svn:open-mpi r16088
Gleb, This patch is not correct. The code preventing the registration of the same communicator twice is later in the code (same file in the function ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid is called, we know that each communicator only handle one "communicator creation" function at the same time. Therefore, we want to give priority to the smallest com_id, which is what happens in the code you removed. Without the condition in the ompi_comm_register_cid (each communicator only get registered once) your comment make sense. However, with the condition your patch allow a dead end situation, while 2 processes try to create communicators in multiple threads, and they will never succeed, simply because they will not order the creation based on the com_id. george. On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote: Author: gleb Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) New Revision: 16088 URL: https://svn.open-mpi.org/trac/ompi/changeset/16088 Log: The code tries to prevent itself from running for more then one communicator simultaneously, but is doing it incorrectly. If the function is running already for one communicator and it is called from another thread for other communicator with lower cid the check comm->c_contextid != ompi_comm_lowest_cid() will fail and the function will be executed for two different communicators by two threads simultaneously. There is nothing in the algorithm that prevent it from been running simultaneously for different communicators as far as I can see, but ompi_comm_unregister_cid() assumes that it is always called for a communicator with the lowest cid and this is not always the case. This patch removes bogus lowest cid check and fix ompi_comm_register_cid() to properly remove cid from the list. Text files modified: trunk/ompi/communicator/comm_cid.c |24 1 files changed, 12 insertions(+), 12 deletions(-) Modified: trunk/ompi/communicator/comm_cid.c == --- trunk/ompi/communicator/comm_cid.c (original) +++ trunk/ompi/communicator/comm_cid.c 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007) @@ -11,6 +11,7 @@ * All rights reserved. * Copyright (c) 2006-2007 University of Houston. All rights reserved. * Copyright (c) 2007 Cisco, Inc. All rights reserved. + * Copyright (c) 2007 Voltaire All rights reserved. * $COPYRIGHT$ * * Additional copyrights may follow @@ -170,15 +171,6 @@ * This is the real algorithm described in the doc */ -OPAL_THREAD_LOCK(_cid_lock); -if (comm->c_contextid != ompi_comm_lowest_cid() ) { -/* if not lowest cid, we do not continue, but sleep and try again */ -OPAL_THREAD_UNLOCK(_cid_lock); -continue; -} -OPAL_THREAD_UNLOCK(_cid_lock); - - for (i=start; i < mca_pml.pml_max_contextid ; i++) { flag=ompi_pointer_array_test_and_set_item (_mpi_communicators, i, comm); @@ -365,10 +357,18 @@ static int ompi_comm_unregister_cid (uint32_t cid) { -ompi_comm_reg_t *regcom=NULL; -opal_list_item_t *item=opal_list_remove_first (_registered_comms); +ompi_comm_reg_t *regcom; +opal_list_item_t *item; -regcom = (ompi_comm_reg_t *) item; +for (item = opal_list_get_first(_registered_comms); + item != opal_list_get_end(_registered_comms); + item = opal_list_get_next(item)) { +regcom = (ompi_comm_reg_t *)item; +if(regcom->cid == cid) { +opal_list_remove_item(_registered_comms, item); +break; +} +} OBJ_RELEASE(regcom); return OMPI_SUCCESS; } ___ svn-full mailing list svn-f...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/svn-full smime.p7s Description: S/MIME cryptographic signature