Hi Pak I can't say for certain, but I believe the problem relates to a change we made in the summer to the default universe name. I encountered a similar problem with the Eclipse folks at that time.
What happened was that Josh was encountering a problem relating to the default universe name when working on orte-ps. At that time, we restructured the default universe name to be "default-pid". This solved the orte-ps problem. However, it created a problem in persistent operations - namely, it became impossible for a process to "know" the name of the persistent daemon's universe. I'm not entirely certain that we fixed that problem. Here's how you can check: 1. run "orted --debug --persistent --seed --scope public" in one window. You will see a bunch of diagnostic output that eventually will stop, leaving the orted waiting for commands. 2. run "mpirun -n 1 uptime" in another window. You should see the orted window scroll a bunch of diagnostic output as the application runs. If you don't, then you know that you did NOT connect to the persistent orted - and you have found the problem. If this is the case, the solution is actually rather trivial: just tell the orted and mpirun the name of the universe they are to use. It would look like this: "orted --persistent --seed --scope public --universe foo" "mpirun --universe foo -n 1 uptime If you do that in the two windows (adding the "--debug" option to the orted as before), you should see the orted window dump a bunch of diagnostic output. Hope that helps. Please let us know what you find out - if this is the problem, we need to find a solution that allows default universe connections, or else document this clearly. Ralph On 9/7/06 3:55 PM, "Pak Lui" <pak....@sun.com> wrote: > Hi Edgar, > > I tried starting the persistent orted before running the client/server > executables without the MPI_Publish_name/MPI_Lookup_name, I am still > getting the same kind of failure, as reported by Rolf earlier (in trac#252). > > The server prints the port and I feed in the port info to the client. > Could you point out what we should have done to make this work? > > http://svn.open-mpi.org/trac/ompi/ticket/252 > > Edgar Gabriel wrote: >> Hi, >> >> sorry for the delay on your request. >> >> There are two things which have to do in order to make a client/server >> example work with Open MPI right now (assuming you are using >> MPI_Comm_connect/MPI_Comm_accept) >> >> First, you have to start the orted daemon in a persistent mode, e.g. >> >> orted --persistent --seed --scope public >> >> Second, you can not use right now MPI_Publish_name/MPI_Lookup_name >> accross multiple jobs, this is unfortunatly a known bug. (Name >> publishing works within the same job however). So what you would have to >> do is pass the port-information of the MPI_Comm_accept call somehow to >> the other process (e.g. printing it using a printf statement in the >> server application and pass it as an input argument to the client >> application). >> >> Hope this helps >> Edgar >> >> >> Eng. A.A. Isola wrote: >>> "It's not possible to connect!!!!" >>> >>> Hi Devel list, crossposting as this >>> is getting weird... >>> >>> I did a client/server using MPI_Publish_name / >>> MPI_Lookup_name >>> and it runs fine on both MPICH2 and LAM-MPI but fail >>> on Open MPI. It's >>> not a simple failure (ie. returning an error code) >>> it breaks the >>> execution line and quits. The server continue to run >>> after the >>> client's crash. >>> >>> >>> The server also use 100% of CPU while >>> running, what doesn't happen with LAM. >>> >>> >>> The code is here: >>> http://www. >>> systemcall.com.br/rengolin/open-mpi/ >>> >>> >>> OpenMP version: 1.1.1 >>> >>> >>> Compiling: >>> mpiCC -o server server.c >>> mpiCC -o client client.c >>> - or >>> - >>> mpiCC -o client client.c -DUSE_LOOKUP >>> >>> >>> Running & Output: >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >