Re: [OMPI devel] devel Digest, Vol 801, Issue 1
I believe that the only problem with that procedure is that it automatically connects the new application with *all* pre-existing applications. There is no discrimination possible as your client doesn't know the server's jobid, nor is there any way for it to "discover" that information. So this is fine IF you want that mode of operation (i.e., all applications running with a persistent daemon that call connect are to be fully interconnected to all predecessors). Perhaps that is adequate, but it isn't what was told to me as the desired functionality. Of course, there may be something in the MPI code that corrects this behavior. What I'm describing is solely what is happening at the RTE level...which means, of course, that the contact info for all those procs is probably being exchanged even if the MPI layer is ignoring some of it. ;-) On 7/17/07 7:13 AM, "Rolf vandeVaart"wrote: > Ralph Castain wrote: >> >> On 7/17/07 5:37 AM, "Jeff Squyres" wrote: >> >> >>> On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote: >>> >>> > MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ > MPI_COMM_CONNECT models. We do support this in Open MPI, but the > restrictions (in terms of ORTE) may not be sufficient for you. > perhaps i'll experiment -- any clues as to what the orte restrictions might be? >>> The main constraint is that you have to run a "persistent" orted that >>> will span all your MPI_COMM_WORLD's. We have only lightly tested >>> this scenario -- Ralph, can you comment more here? >>> >> >> Actually, I'm not convinced Open MPI really supports either of those two MPI >> semantics. It is true that we have something in our code repository, but I'm >> not convinced it actually does what people think. >> >> There are two use-cases one must consider: >> >> 1. an application code spawns another job and then at some later point wants >> to connect to it. Our current implementation of comm_spawn does this >> automatically via the accept/connect procedure, so we have this covered. >> However, it relies upon the notion that (a) the parent job *knows* the jobid >> of the child, and (b) the parent sends a message to the child telling it >> where and how to rendezvous with it. You don't need the persistent daemon >> here. >> >> 2. a user starts one application, and then starts another (would have to be >> in a separate window or batch job as we do not support running mpirun in the >> background) that connects to the first. The problem here is that neither >> application knows the jobid of the other, has no info on how to communicate >> with the other, nor knows a common rendezvous point. You would definitely >> need a persistent daemon for this use-case. >> >> I would have to review the code to see, but my best guess from what I >> remember is that we don't actually support the second use-case at this time. >> It would be possible to do so, albeit complicated - but I'm almost certain >> nobody ever implemented it. I had talked at one time about providing the >> necessary glue, either at the command line or (better) via some internal >> "magic", but never got much interest - and so never did anything about >> it...and I don't recall seeing anyone else make the necessary changes. >> > FWIW, these are the instructions that we documented for OMPI v1.2 for > client/server > (MPI_COMM_ACCEPT and MPI_COMM_CONNECT) from different jobs. > > -- > --- > > USING MPI CLIENT/SERVER APPLICATIONS > > The instructions in this section explain how to get best results when > starting Open > MPI client/server applications. > > To Start the Persistent Daemon > Note The persistent daemon needs to run on the node where mpirun is > started. > 1. Use the cd command to move to the directory that contains the Sun HPC > ClusterTools 7 binaries. > % cd /opt/SUNWhpc/HPC7.0/bin > 2. To start the persistent daemon, issue the following command, > substituting the > name of your MPI job¹s universe for univ1: > % orted --persistent --seed --scope public --universe univ1 --debug > > The --persistent flag to orted (the ORTE daemon) starts the persistent > daemon. > You also need to set the --seed and --scope public options on the same > command line, as shown in the example. The optional --debug flag prints out > debugging messages. > > TO LAUNCH THE CLIENT/SERVER JOB > Note Make sure you launch all MPI client/server jobs from the same node on > which you started the persistent daemon. > 1. Type the following command to launch the server application. > Substitute the > name of your MPI job¹s universe for univ1: > % ./mpirun -np 1 --universe univ1 t_accept > 2. Type the following command to launch the client application, > substituting the > name of your MPI job¹s universe for univ1: > % ./mpirun -np 4 --universe univ1 t_connect > > If the
Re: [OMPI devel] devel Digest, Vol 801, Issue 1
Ralph Castain wrote: On 7/17/07 5:37 AM, "Jeff Squyres"wrote: On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote: MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ MPI_COMM_CONNECT models. We do support this in Open MPI, but the restrictions (in terms of ORTE) may not be sufficient for you. perhaps i'll experiment -- any clues as to what the orte restrictions might be? The main constraint is that you have to run a "persistent" orted that will span all your MPI_COMM_WORLD's. We have only lightly tested this scenario -- Ralph, can you comment more here? Actually, I'm not convinced Open MPI really supports either of those two MPI semantics. It is true that we have something in our code repository, but I'm not convinced it actually does what people think. There are two use-cases one must consider: 1. an application code spawns another job and then at some later point wants to connect to it. Our current implementation of comm_spawn does this automatically via the accept/connect procedure, so we have this covered. However, it relies upon the notion that (a) the parent job *knows* the jobid of the child, and (b) the parent sends a message to the child telling it where and how to rendezvous with it. You don't need the persistent daemon here. 2. a user starts one application, and then starts another (would have to be in a separate window or batch job as we do not support running mpirun in the background) that connects to the first. The problem here is that neither application knows the jobid of the other, has no info on how to communicate with the other, nor knows a common rendezvous point. You would definitely need a persistent daemon for this use-case. I would have to review the code to see, but my best guess from what I remember is that we don't actually support the second use-case at this time. It would be possible to do so, albeit complicated - but I'm almost certain nobody ever implemented it. I had talked at one time about providing the necessary glue, either at the command line or (better) via some internal "magic", but never got much interest - and so never did anything about it...and I don't recall seeing anyone else make the necessary changes. FWIW, these are the instructions that we documented for OMPI v1.2 for client/server (MPI_COMM_ACCEPT and MPI_COMM_CONNECT) from different jobs. - USING MPI CLIENT/SERVER APPLICATIONS The instructions in this section explain how to get best results when starting Open MPI client/server applications. To Start the Persistent Daemon Note – The persistent daemon needs to run on the node where mpirun is started. 1. Use the cd command to move to the directory that contains the Sun HPC ClusterTools 7 binaries. % cd /opt/SUNWhpc/HPC7.0/bin 2. To start the persistent daemon, issue the following command, substituting the name of your MPI job’s universe for univ1: % orted --persistent --seed --scope public --universe univ1 --debug The --persistent flag to orted (the ORTE daemon) starts the persistent daemon. You also need to set the --seed and --scope public options on the same command line, as shown in the example. The optional --debug flag prints out debugging messages. TO LAUNCH THE CLIENT/SERVER JOB Note – Make sure you launch all MPI client/server jobs from the same node on which you started the persistent daemon. 1. Type the following command to launch the server application. Substitute the name of your MPI job’s universe for univ1: % ./mpirun -np 1 --universe univ1 t_accept 2. Type the following command to launch the client application, substituting the name of your MPI job’s universe for univ1: % ./mpirun -np 4 --universe univ1 t_connect If the client and server jobs span more than 1 node, the first job (that is, the server job) must specify on the mpirun command line all the nodes that will be used. Specifying the node names allocates the specified hosts from the entire universe of server and client jobs. For example, if the server runs on node0 and the client job runs on node1 only, the command to launch the server must specify both nodes (using the -host node0,node1 flag) even it uses only one process on node0. Assuming that the persistent daemon is started on node0, the command to launch the server would look like this: node0% ./mpirun -np 1 --universe univ1 -host node0,node1 t_accept The command to launch the client is: n ode0% ./mpirun -np 4 --universe univ1 -host node1 t_connect Note – Name publishing does not work in jobs between different universes.
Re: [OMPI devel] devel Digest, Vol 801, Issue 1
On 7/17/07 5:37 AM, "Jeff Squyres"wrote: > On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote: > >>> MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ >>> MPI_COMM_CONNECT models. We do support this in Open MPI, but the >>> restrictions (in terms of ORTE) may not be sufficient for you. >> >> perhaps i'll experiment -- any clues as to what the orte >> restrictions might be? > > The main constraint is that you have to run a "persistent" orted that > will span all your MPI_COMM_WORLD's. We have only lightly tested > this scenario -- Ralph, can you comment more here? Actually, I'm not convinced Open MPI really supports either of those two MPI semantics. It is true that we have something in our code repository, but I'm not convinced it actually does what people think. There are two use-cases one must consider: 1. an application code spawns another job and then at some later point wants to connect to it. Our current implementation of comm_spawn does this automatically via the accept/connect procedure, so we have this covered. However, it relies upon the notion that (a) the parent job *knows* the jobid of the child, and (b) the parent sends a message to the child telling it where and how to rendezvous with it. You don't need the persistent daemon here. 2. a user starts one application, and then starts another (would have to be in a separate window or batch job as we do not support running mpirun in the background) that connects to the first. The problem here is that neither application knows the jobid of the other, has no info on how to communicate with the other, nor knows a common rendezvous point. You would definitely need a persistent daemon for this use-case. I would have to review the code to see, but my best guess from what I remember is that we don't actually support the second use-case at this time. It would be possible to do so, albeit complicated - but I'm almost certain nobody ever implemented it. I had talked at one time about providing the necessary glue, either at the command line or (better) via some internal "magic", but never got much interest - and so never did anything about it...and I don't recall seeing anyone else make the necessary changes. > >>> - It also likely doesn't work yet; we started the integration work >>> and ran into a technical issue that required further discussion with >>> Platform. They're currently looking into it; we stopped the LSF work >>> in ORTE until they get back to us. >> >> i see -- i might be trying to work on the 6.x support today. can you >> give me any hints on what the problem was in case i run into the same >> issue? > > Something was wrong with the lsb_launch() function; using it caused a > significant slowdown in the job and it generally wasn't behaving as > expected. Platform issued a fix for me yesterday (i.e., a one-off/ > unsupported binary for development purposes) that I haven't gotten to > test yet. > >>> - That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might* >>> offer a way out here. But I think a) THREAD_MULTIPLE isn't working >>> yet (other OMPI members are working on this), and b) even when >>> THREAD_MULTIPLE works, there will be ORTE issues to deal with >>> (canceling pending resource allocations, etc.). Ralph mentioned that >>> someone else is working on such things on the TM/PBS/Torque side; I >>> haven't followed that effort closely. >> >> it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there >> are perhaps other workarounds (using threads in other ways, etc.). >> also, i'd love to hear about the existing efforts -- i'm hoping >> someone working on them might be reading this ... ;) > > Ralph -- can you chime in on the TM/PBS/Torque efforts? It isn't my work. I can ask the other developer if he is interested in talking with you and/or willing for me to make his work more public (part of it has been discussed on the public user list). I believe this is part of his PhD thesis, so I want to err on the side of caution here.
Re: [OMPI devel] devel Digest, Vol 801, Issue 1
On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote: MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ MPI_COMM_CONNECT models. We do support this in Open MPI, but the restrictions (in terms of ORTE) may not be sufficient for you. perhaps i'll experiment -- any clues as to what the orte restrictions might be? The main constraint is that you have to run a "persistent" orted that will span all your MPI_COMM_WORLD's. We have only lightly tested this scenario -- Ralph, can you comment more here? - It also likely doesn't work yet; we started the integration work and ran into a technical issue that required further discussion with Platform. They're currently looking into it; we stopped the LSF work in ORTE until they get back to us. i see -- i might be trying to work on the 6.x support today. can you give me any hints on what the problem was in case i run into the same issue? Something was wrong with the lsb_launch() function; using it caused a significant slowdown in the job and it generally wasn't behaving as expected. Platform issued a fix for me yesterday (i.e., a one-off/ unsupported binary for development purposes) that I haven't gotten to test yet. - That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might* offer a way out here. But I think a) THREAD_MULTIPLE isn't working yet (other OMPI members are working on this), and b) even when THREAD_MULTIPLE works, there will be ORTE issues to deal with (canceling pending resource allocations, etc.). Ralph mentioned that someone else is working on such things on the TM/PBS/Torque side; I haven't followed that effort closely. it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there are perhaps other workarounds (using threads in other ways, etc.). also, i'd love to hear about the existing efforts -- i'm hoping someone working on them might be reading this ... ;) Ralph -- can you chime in on the TM/PBS/Torque efforts? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] devel Digest, Vol 801, Issue 1
hi again, >>> i'll probably just continue experimenting on my own for the >>> moment (tracking >>> any updates to the main trunk LSF support) to see if i can figure >>> it out. any >>> advice the best way to get such back support into trunk, if and >>> when if exists >>> / is working? >> >> The *best* way would be for you to sign a third-party agreement - >> see the >> web site for details and a copy. Barring that, the only option >> would be to >> submit the code through either Jeff or I. We greatly prefer the >> agreement >> method as it is (a) less burdensome on us and (b) gives you greater >> flexibility. > > i'll talk to 'the man' -- it should be okay ... eventually, at > least ... See http://www.open-mpi.org/community/contribute/ for details. As an open project, we always welcome new developers, but we do need to keep the IP tidy. will do. MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/ MPI_COMM_CONNECT models. We do support this in Open MPI, but the restrictions (in terms of ORTE) may not be sufficient for you. perhaps i'll experiment -- any clues as to what the orte restrictions might be? Some other random notes in no particular order: - As you noted, the LSF support is *very* new; it was just added last week. - It also likely doesn't work yet; we started the integration work and ran into a technical issue that required further discussion with Platform. They're currently looking into it; we stopped the LSF work in ORTE until they get back to us. i see -- i might be trying to work on the 6.x support today. can you give me any hints on what the problem was in case i run into the same issue? - FWIW, one of the main reasons OMPI/ORTE didn't add extensive/ flexible support for dynamic addition of resources was the potential for queue time. Many systems run "full" all the time, so if you try to acquire more resources, you could just sit in a queue for minutes/ hours/days/weeks before getting nodes. While it is certainly possible to program with this model, we didn't really want to get into the rats nest of corner cases that this would entail, especially since very few users are asking for it. yeah, it does seem like the queuing issue is critical. i think as long as the requests for more resources are non-blocking, and the application itself can deal with that, it shouldn't create too many corner cases. in fact, if the application wants to block (potentially for a long time) that might be okay too (i.e. on the initial big allocation, just after some startup routine determines the needed initial resources). - That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might* offer a way out here. But I think a) THREAD_MULTIPLE isn't working yet (other OMPI members are working on this), and b) even when THREAD_MULTIPLE works, there will be ORTE issues to deal with (canceling pending resource allocations, etc.). Ralph mentioned that someone else is working on such things on the TM/PBS/Torque side; I haven't followed that effort closely. it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there are perhaps other workarounds (using threads in other ways, etc.). also, i'd love to hear about the existing efforts -- i'm hoping someone working on them might be reading this ... ;) > well, certainly part of the issue is the need (or at least strong > preference) to support 6.2 -- but read on. > [SNIP LSF API info/guesswork] I am certainly not an expert on LSF (nor its API) -- I only started using it last week! Do you have any contacts to ask at Platform? They would likely be the best ones to discuss this with. i'm in the same boat. i'll try to talk to the people here at cadence that might have said contacts at Platform. -- Jeff Squyres Cisco Systems Matt.