Yes, they should find one another. You do have to pass the server’s URI on the 
mpirun cmd line so it can find it.

> On Nov 8, 2016, at 12:40 PM, Pieter Noordhuis <piet...@fb.com> wrote:
> 
> I'll open an issue on GitHub with more details.
> 
> As far as the accept/connect issue, do you expect the processes to find each 
> other if they do a publish/lookup through the same orte-server if they are 
> started separately?
> 
> I can try with master as well. I'm working off of 2.0.1 now (using stable 
> because I want to rule out any other issues that might be present in master).
> 
> Thanks,
> Pieter
> From: devel <devel-boun...@lists.open-mpi.org> on behalf of r...@open-mpi.org 
> <r...@open-mpi.org>
> Sent: Tuesday, November 8, 2016 12:17:29 PM
> To: OpenMPI Devel
> Subject: Re: [OMPI devel] PMIx in 2.x
>  
> Should be handled more gracefully, of course. When a proc departs, we cleanup 
> any published storage that wasn’t specifically indicated to be retained for a 
> longer period of time (see pmix_common.h for the various supported options). 
> You’ve obviously found a race condition, and we’ll have to track it down.
> 
> I gather you are working off of the OMPI 2.x repo? The PMIx support over 
> there is fairly far behind the OMPI master, though neither branches have seen 
> any real testing with the orte-server support. I’m a little buried at the 
> moment, but can try to check that out in a bit.
> 
> 
>> On Nov 8, 2016, at 10:30 AM, Pieter Noordhuis <piet...@fb.com 
>> <mailto:piet...@fb.com>> wrote:
>> 
>> ... accidentally sent before finishing.
>> 
>> This error happened because lookup was returning multiple published entries, 
>> N-1 of which had already disconnected. Should this case be handled more 
>> gracefully, or is it expected to fail like this?
>> 
>> Thanks,
>> Pieter
>> From: Pieter Noordhuis
>> Sent: Tuesday, November 8, 2016 10:29:02 AM
>> To: OpenMPI Devel
>> Subject: Re: [OMPI devel] PMIx in 2.x
>>  
>> Ah, that's good to know. I'm trying to wire things up through orte-server on 
>> 2.x now and am at the point where I can do an MPI_Publish_name on one node 
>> and MPI_Lookup_name on another (both started through their own mpirun 
>> calls). Then, when the first process does an MPI_Comm_accept and the second 
>> process does an MPI_Comm_connect on that port, I'm getting an unreachable 
>> error from ompi/mca/bml/r2/bml_r2.c. I don't see the IPs of either process 
>> being communicated and with just the PMIx port name they won't be able to 
>> connect. Is there anything I'm missing?
>> 
>> Also, when trying to get this to work, I kept orte-server running between 
>> runs, and saw a rather alarming error when doing a lookup:
>> [[8602,0],0] recvd lookup returned data ServerName of type 3 from source 
>> [[49864,1],0]                                                                
>>                                          
>> [[8602,0],0] recvd lookup returned data ServerName of type 3 from source 
>> [[10062,1],0]                                                                
>>                                          
>> PACK-PMIX-VALUE: UNSUPPORTED TYPE 0                                          
>>                                                                              
>>                                      
>> PMIX ERROR: ERROR in file src/server/pmix_server.c at line 1881              
>>                                                                              
>>                                      
>> 
>> 
>> 
>> 
>> From: devel <devel-boun...@lists.open-mpi.org 
>> <mailto:devel-boun...@lists.open-mpi.org>> on behalf of r...@open-mpi.org 
>> <mailto:r...@open-mpi.org> <r...@open-mpi.org <mailto:r...@open-mpi.org>>
>> Sent: Monday, November 7, 2016 12:16:50 PM
>> To: OpenMPI Devel
>> Subject: Re: [OMPI devel] PMIx in 2.x
>>  
>> I’m not sure your description of the 1.x behavior is entirely accurate. What 
>> actually happened in that scenario is that the various mpirun’s would 
>> connect to each other, proxying the various MPI dynamic calls across each 
>> other. You had to tell each mpirun how to find the other - this was in the 
>> “port” string you might have been using. It might be possible to recreate it 
>> by adding the PMIx server address of the other mpirun to the string, though 
>> it would still be an ugly way to start a job.
>> 
>> The other option in 1.x was to start the orte-server and let it provide the 
>> rendezvous point. This option remains - it just uses PMIx calls to implement 
>> the rendezvous.
>> 
>> 
>>> On Nov 7, 2016, at 9:23 AM, Pieter Noordhuis <piet...@fb.com 
>>> <mailto:piet...@fb.com>> wrote:
>>> 
>>> Hi folks,
>>> 
>>> Question about PMIx in the 2.x tree: on 1.x I used to be able to start N 
>>> individual jobs through mpirun with -np1 and have them gradually join a 
>>> single intercommunicator through MPI_Comm_accept, MPI_Comm_connect, 
>>> MPI_Intercomm_create, and MPI_Intercomm_merge. The port that one of the 
>>> processes would listen on included its IP address and others would connect 
>>> to that. I tried porting this code to the 2.x tree and found the port is 
>>> now just an integer. Reading up on the changelogs and commit history, I 
>>> found PMIx replaced DPM starting with 2.x. Reading up on PMIx and OpenMPI, 
>>> my understanding is that OpenMPI ships with a PMIx server implementation, 
>>> and that all processes in the job have to be connected to this PMIx server 
>>> at start. It looks like MPI_Comm_accept and MPI_Comm_connect communicate 
>>> through k/v pairs in the PMIx server.
>>> 
>>> This means it's no longer possible to start jobs through multiple mpirun 
>>> executions and then join them into a single intercommunicator at runtime. 
>>> Is my understanding correct?
>>> 
>>> Thank you,
>>> 
>>> Pieter
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__rfd.newmexicoconsortium.org_mailman_listinfo_devel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=tuQ8EqTFx1hLmZlGmHhKyQ&m=Ko4MqU7Lpf3quKyo-Iml09eDyTnf9tsKHvuTE8dG88g&s=yUt8936rdnMy7-4MI8taEASydqFMg5eY8Uzb960HBFg&e=>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel 
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__rfd.newmexicoconsortium.org_mailman_listinfo_devel&d=DQMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=tuQ8EqTFx1hLmZlGmHhKyQ&m=UcLPYEEqXenFPuKX59icTAh4eAeNZdUBKKfaSnO6pQg&s=Y9XQZqG6q8jSq7R9_jAfm2cPAY9Z5upHqvKIjki5GcE&e=>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to