Hi Pak

I can't say for certain, but I believe the problem relates to a change we
made in the summer to the default universe name. I encountered a similar
problem with the Eclipse folks at that time.

What happened was that Josh was encountering a problem relating to the
default universe name when working on orte-ps.  At that time, we
restructured the default universe name to be "default-pid". This solved the
orte-ps problem.

However, it created a problem in persistent operations - namely, it became
impossible for a process to "know" the name of the persistent daemon's
universe. I'm not entirely certain that we fixed that problem.

Here's how you can check:

1. run "orted --debug --persistent --seed --scope public" in one window. You
will see a bunch of diagnostic output that eventually will stop, leaving the
orted waiting for commands.

2. run "mpirun -n 1 uptime" in another window. You should see the orted
window scroll a bunch of diagnostic output as the application runs. If you
don't, then you know that you did NOT connect to the persistent orted - and
you have found the problem.

If this is the case, the solution is actually rather trivial: just tell the
orted and mpirun the name of the universe they are to use. It would look
like this:

"orted --persistent --seed --scope public --universe foo"

"mpirun --universe foo -n 1 uptime

If you do that in the two windows (adding the "--debug" option to the orted
as before), you should see the orted window dump a bunch of diagnostic
output.

Hope that helps. Please let us know what you find out - if this is the
problem, we need to find a solution that allows default universe
connections, or else document this clearly.

Ralph


On 9/7/06 3:55 PM, "Pak Lui" <pak....@sun.com> wrote:

> Hi Edgar,
> 
> I tried starting the persistent orted before running the client/server
> executables without the MPI_Publish_name/MPI_Lookup_name, I am still
> getting the same kind of failure, as reported by Rolf earlier (in trac#252).
> 
> The server prints the port and I feed in the port info to the client.
> Could you point out what we should have done to make this work?
> 
> http://svn.open-mpi.org/trac/ompi/ticket/252
> 
> Edgar Gabriel wrote:
>> Hi,
>> 
>> sorry for the delay on your request.
>> 
>> There are two things which have to do in order to make a client/server
>> example work with Open MPI right now (assuming you are using
>> MPI_Comm_connect/MPI_Comm_accept)
>> 
>> First, you have to start the orted daemon in a persistent mode, e.g.
>> 
>> orted --persistent --seed --scope public
>> 
>> Second, you can not use right now MPI_Publish_name/MPI_Lookup_name
>> accross multiple jobs, this is unfortunatly a known bug. (Name
>> publishing works within the same job however). So what you would have to
>> do is pass the port-information of the MPI_Comm_accept call somehow to
>> the other process (e.g. printing it using a printf statement in the
>> server application and pass it as an input argument to the client
>> application).
>> 
>> Hope this helps
>> Edgar
>> 
>> 
>> Eng. A.A. Isola wrote:
>>> "It's not possible to connect!!!!"
>>> 
>>> Hi Devel list, crossposting as this
>>> is getting weird...
>>> 
>>> I did a client/server using MPI_Publish_name /
>>> MPI_Lookup_name
>>> and it runs fine on both MPICH2 and LAM-MPI but fail
>>> on Open MPI. It's
>>> not a simple failure (ie. returning an error code)
>>> it breaks the 
>>> execution line and quits. The server continue to run
>>> after the 
>>> client's crash.
>>> 
>>> 
>>> The server also use 100% of CPU while
>>> running, what doesn't happen with LAM.
>>> 
>>> 
>>> The code is here:
>>> http://www.
>>> systemcall.com.br/rengolin/open-mpi/
>>> 
>>> 
>>> OpenMP version: 1.1.1
>>> 
>>> 
>>> Compiling: 
>>> mpiCC -o server server.c
>>> mpiCC -o client client.c
>>>  - or 
>>> - 
>>> mpiCC -o client client.c -DUSE_LOOKUP
>>> 
>>> 
>>> Running & Output:
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


Reply via email to