Yes, they should find one another. You do have to pass the server’s URI on the mpirun cmd line so it can find it.
> On Nov 8, 2016, at 12:40 PM, Pieter Noordhuis <piet...@fb.com> wrote: > > I'll open an issue on GitHub with more details. > > As far as the accept/connect issue, do you expect the processes to find each > other if they do a publish/lookup through the same orte-server if they are > started separately? > > I can try with master as well. I'm working off of 2.0.1 now (using stable > because I want to rule out any other issues that might be present in master). > > Thanks, > Pieter > From: devel <devel-boun...@lists.open-mpi.org> on behalf of r...@open-mpi.org > <r...@open-mpi.org> > Sent: Tuesday, November 8, 2016 12:17:29 PM > To: OpenMPI Devel > Subject: Re: [OMPI devel] PMIx in 2.x > > Should be handled more gracefully, of course. When a proc departs, we cleanup > any published storage that wasn’t specifically indicated to be retained for a > longer period of time (see pmix_common.h for the various supported options). > You’ve obviously found a race condition, and we’ll have to track it down. > > I gather you are working off of the OMPI 2.x repo? The PMIx support over > there is fairly far behind the OMPI master, though neither branches have seen > any real testing with the orte-server support. I’m a little buried at the > moment, but can try to check that out in a bit. > > >> On Nov 8, 2016, at 10:30 AM, Pieter Noordhuis <piet...@fb.com >> <mailto:piet...@fb.com>> wrote: >> >> ... accidentally sent before finishing. >> >> This error happened because lookup was returning multiple published entries, >> N-1 of which had already disconnected. Should this case be handled more >> gracefully, or is it expected to fail like this? >> >> Thanks, >> Pieter >> From: Pieter Noordhuis >> Sent: Tuesday, November 8, 2016 10:29:02 AM >> To: OpenMPI Devel >> Subject: Re: [OMPI devel] PMIx in 2.x >> >> Ah, that's good to know. I'm trying to wire things up through orte-server on >> 2.x now and am at the point where I can do an MPI_Publish_name on one node >> and MPI_Lookup_name on another (both started through their own mpirun >> calls). Then, when the first process does an MPI_Comm_accept and the second >> process does an MPI_Comm_connect on that port, I'm getting an unreachable >> error from ompi/mca/bml/r2/bml_r2.c. I don't see the IPs of either process >> being communicated and with just the PMIx port name they won't be able to >> connect. Is there anything I'm missing? >> >> Also, when trying to get this to work, I kept orte-server running between >> runs, and saw a rather alarming error when doing a lookup: >> [[8602,0],0] recvd lookup returned data ServerName of type 3 from source >> [[49864,1],0] >> >> [[8602,0],0] recvd lookup returned data ServerName of type 3 from source >> [[10062,1],0] >> >> PACK-PMIX-VALUE: UNSUPPORTED TYPE 0 >> >> >> PMIX ERROR: ERROR in file src/server/pmix_server.c at line 1881 >> >> >> >> >> >> >> From: devel <devel-boun...@lists.open-mpi.org >> <mailto:devel-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org >> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>> >> Sent: Monday, November 7, 2016 12:16:50 PM >> To: OpenMPI Devel >> Subject: Re: [OMPI devel] PMIx in 2.x >> >> I’m not sure your description of the 1.x behavior is entirely accurate. What >> actually happened in that scenario is that the various mpirun’s would >> connect to each other, proxying the various MPI dynamic calls across each >> other. You had to tell each mpirun how to find the other - this was in the >> “port” string you might have been using. It might be possible to recreate it >> by adding the PMIx server address of the other mpirun to the string, though >> it would still be an ugly way to start a job. >> >> The other option in 1.x was to start the orte-server and let it provide the >> rendezvous point. This option remains - it just uses PMIx calls to implement >> the rendezvous. >> >> >>> On Nov 7, 2016, at 9:23 AM, Pieter Noordhuis <piet...@fb.com >>> <mailto:piet...@fb.com>> wrote: >>> >>> Hi folks, >>> >>> Question about PMIx in the 2.x tree: on 1.x I used to be able to start N >>> individual jobs through mpirun with -np1 and have them gradually join a >>> single intercommunicator through MPI_Comm_accept, MPI_Comm_connect, >>> MPI_Intercomm_create, and MPI_Intercomm_merge. The port that one of the >>> processes would listen on included its IP address and others would connect >>> to that. I tried porting this code to the 2.x tree and found the port is >>> now just an integer. Reading up on the changelogs and commit history, I >>> found PMIx replaced DPM starting with 2.x. Reading up on PMIx and OpenMPI, >>> my understanding is that OpenMPI ships with a PMIx server implementation, >>> and that all processes in the job have to be connected to this PMIx server >>> at start. It looks like MPI_Comm_accept and MPI_Comm_connect communicate >>> through k/v pairs in the PMIx server. >>> >>> This means it's no longer possible to start jobs through multiple mpirun >>> executions and then join them into a single intercommunicator at runtime. >>> Is my understanding correct? >>> >>> Thank you, >>> >>> Pieter >>> _______________________________________________ >>> devel mailing list >>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__rfd.newmexicoconsortium.org_mailman_listinfo_devel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=tuQ8EqTFx1hLmZlGmHhKyQ&m=Ko4MqU7Lpf3quKyo-Iml09eDyTnf9tsKHvuTE8dG88g&s=yUt8936rdnMy7-4MI8taEASydqFMg5eY8Uzb960HBFg&e=> >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org> >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__rfd.newmexicoconsortium.org_mailman_listinfo_devel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=tuQ8EqTFx1hLmZlGmHhKyQ&m=UcLPYEEqXenFPuKX59icTAh4eAeNZdUBKKfaSnO6pQg&s=Y9XQZqG6q8jSq7R9_jAfm2cPAY9Z5upHqvKIjki5GcE&e=> > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel