Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-17 Thread Ralph Castain
I believe that the only problem with that procedure is that it automatically
connects the new application with *all* pre-existing applications. There is
no discrimination possible as your client doesn't know the server's jobid,
nor is there any way for it to "discover" that information.

So this is fine IF you want that mode of operation (i.e., all applications
running with a persistent daemon that call connect are to be fully
interconnected to all predecessors). Perhaps that is adequate, but it isn't
what was told to me as the desired functionality.

Of course, there may be something in the MPI code that corrects this
behavior. What I'm describing is solely what is happening at the RTE
level...which means, of course, that the contact info for all those procs is
probably being exchanged even if the MPI layer is ignoring some of it. ;-)


On 7/17/07 7:13 AM, "Rolf vandeVaart"  wrote:

> Ralph Castain wrote:
>> 
>> On 7/17/07 5:37 AM, "Jeff Squyres"  wrote:
>> 
>>   
>>> On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:
>>> 
>>> 
> MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
> MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
> restrictions (in terms of ORTE) may not be sufficient for you.
> 
 perhaps i'll experiment -- any clues as to what the orte
 restrictions might be?
   
>>> The main constraint is that you have to run a "persistent" orted that
>>> will span all your MPI_COMM_WORLD's.  We have only lightly tested
>>> this scenario -- Ralph, can you comment more here?
>>> 
>> 
>> Actually, I'm not convinced Open MPI really supports either of those two MPI
>> semantics. It is true that we have something in our code repository, but I'm
>> not convinced it actually does what people think.
>> 
>> There are two use-cases one must consider:
>> 
>> 1. an application code spawns another job and then at some later point wants
>> to connect to it. Our current implementation of comm_spawn does this
>> automatically via the accept/connect procedure, so we have this covered.
>> However, it relies upon the notion that (a) the parent job *knows* the jobid
>> of the child, and (b) the parent sends a message to the child telling it
>> where and how to rendezvous with it. You don't need the persistent daemon
>> here.
>> 
>> 2. a user starts one application, and then starts another (would have to be
>> in a separate window or batch job as we do not support running mpirun in the
>> background) that connects to the first. The problem here is that neither
>> application knows the jobid of the other, has no info on how to communicate
>> with the other, nor knows a common rendezvous point. You would definitely
>> need a persistent daemon for this use-case.
>> 
>> I would have to review the code to see, but my best guess from what I
>> remember is that we don't actually support the second use-case at this time.
>> It would be possible to do so, albeit complicated - but I'm almost certain
>> nobody ever implemented it. I had talked at one time about providing the
>> necessary glue, either at the command line or (better) via some internal
>> "magic", but never got much interest - and so never did anything about
>> it...and I don't recall seeing anyone else make the necessary changes.
>>   
> FWIW, these are the instructions that we documented for OMPI v1.2 for
> client/server
> (MPI_COMM_ACCEPT and MPI_COMM_CONNECT) from different jobs.
> 
> --
> ---
> 
> USING MPI CLIENT/SERVER APPLICATIONS
> 
> The instructions in this section explain how to get best results when
> starting Open
> MPI client/server applications.
> 
> To Start the Persistent Daemon
> Note ­ The persistent daemon needs to run on the node where mpirun is
> started.
> 1. Use the cd command to move to the directory that contains the Sun HPC
> ClusterTools 7 binaries.
> % cd /opt/SUNWhpc/HPC7.0/bin
> 2. To start the persistent daemon, issue the following command,
> substituting the
> name of your MPI job¹s universe for univ1:
> % orted --persistent --seed --scope public --universe univ1 --debug
> 
> The --persistent flag to orted (the ORTE daemon) starts the persistent
> daemon.
> You also need to set the --seed and --scope public options on the same
> command line, as shown in the example. The optional --debug flag prints out
> debugging messages.
> 
> TO LAUNCH THE CLIENT/SERVER JOB
> Note ­ Make sure you launch all MPI client/server jobs from the same node on
> which you started the persistent daemon.
> 1. Type the following command to launch the server application.
> Substitute the
> name of your MPI job¹s universe for univ1:
> % ./mpirun -np 1 --universe univ1 t_accept
> 2. Type the following command to launch the client application,
> substituting the
> name of your MPI job¹s universe for univ1:
> % ./mpirun -np 4 --universe univ1 t_connect
> 
> If the 

Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-17 Thread Rolf vandeVaart

Ralph Castain wrote:


On 7/17/07 5:37 AM, "Jeff Squyres"  wrote:

  

On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:



MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.


perhaps i'll experiment -- any clues as to what the orte
restrictions might be?
  

The main constraint is that you have to run a "persistent" orted that
will span all your MPI_COMM_WORLD's.  We have only lightly tested
this scenario -- Ralph, can you comment more here?



Actually, I'm not convinced Open MPI really supports either of those two MPI
semantics. It is true that we have something in our code repository, but I'm
not convinced it actually does what people think.

There are two use-cases one must consider:

1. an application code spawns another job and then at some later point wants
to connect to it. Our current implementation of comm_spawn does this
automatically via the accept/connect procedure, so we have this covered.
However, it relies upon the notion that (a) the parent job *knows* the jobid
of the child, and (b) the parent sends a message to the child telling it
where and how to rendezvous with it. You don't need the persistent daemon
here.

2. a user starts one application, and then starts another (would have to be
in a separate window or batch job as we do not support running mpirun in the
background) that connects to the first. The problem here is that neither
application knows the jobid of the other, has no info on how to communicate
with the other, nor knows a common rendezvous point. You would definitely
need a persistent daemon for this use-case.

I would have to review the code to see, but my best guess from what I
remember is that we don't actually support the second use-case at this time.
It would be possible to do so, albeit complicated - but I'm almost certain
nobody ever implemented it. I had talked at one time about providing the
necessary glue, either at the command line or (better) via some internal
"magic", but never got much interest - and so never did anything about
it...and I don't recall seeing anyone else make the necessary changes.
  
FWIW, these are the instructions that we documented for OMPI v1.2 for 
client/server

(MPI_COMM_ACCEPT and MPI_COMM_CONNECT) from different jobs.

-

USING MPI CLIENT/SERVER APPLICATIONS

The instructions in this section explain how to get best results when 
starting Open

MPI client/server applications.

To Start the Persistent Daemon
Note – The persistent daemon needs to run on the node where mpirun is 
started.

1. Use the cd command to move to the directory that contains the Sun HPC
ClusterTools 7 binaries.
% cd /opt/SUNWhpc/HPC7.0/bin
2. To start the persistent daemon, issue the following command, 
substituting the

name of your MPI job’s universe for univ1:
% orted --persistent --seed --scope public --universe univ1 --debug

The --persistent flag to orted (the ORTE daemon) starts the persistent 
daemon.

You also need to set the --seed and --scope public options on the same
command line, as shown in the example. The optional --debug flag prints out
debugging messages.

TO LAUNCH THE CLIENT/SERVER JOB
Note – Make sure you launch all MPI client/server jobs from the same node on
which you started the persistent daemon.
1. Type the following command to launch the server application. 
Substitute the

name of your MPI job’s universe for univ1:
% ./mpirun -np 1 --universe univ1 t_accept
2. Type the following command to launch the client application, 
substituting the

name of your MPI job’s universe for univ1:
% ./mpirun -np 4 --universe univ1 t_connect

If the client and server jobs span more than 1 node, the first job (that 
is, the server
job) must specify on the mpirun command line all the nodes that will be 
used.
Specifying the node names allocates the specified hosts from the entire 
universe of

server and client jobs.

For example, if the server runs on node0 and the client job runs on 
node1 only, the

command to launch the server must specify both nodes (using the -host
node0,node1 flag) even it uses only one process on node0.

Assuming that the persistent daemon is started on node0, the command to 
launch

the server would look like this:
node0% ./mpirun -np 1 --universe univ1 -host node0,node1 t_accept
The command to launch the client is:
n ode0% ./mpirun -np 4 --universe univ1 -host node1 t_connect

Note – Name publishing does not work in jobs between different universes.



Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-17 Thread Ralph Castain



On 7/17/07 5:37 AM, "Jeff Squyres"  wrote:

> On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:
> 
>>> MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
>>> MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
>>> restrictions (in terms of ORTE) may not be sufficient for you.
>> 
>> perhaps i'll experiment -- any clues as to what the orte
>> restrictions might be?
> 
> The main constraint is that you have to run a "persistent" orted that
> will span all your MPI_COMM_WORLD's.  We have only lightly tested
> this scenario -- Ralph, can you comment more here?

Actually, I'm not convinced Open MPI really supports either of those two MPI
semantics. It is true that we have something in our code repository, but I'm
not convinced it actually does what people think.

There are two use-cases one must consider:

1. an application code spawns another job and then at some later point wants
to connect to it. Our current implementation of comm_spawn does this
automatically via the accept/connect procedure, so we have this covered.
However, it relies upon the notion that (a) the parent job *knows* the jobid
of the child, and (b) the parent sends a message to the child telling it
where and how to rendezvous with it. You don't need the persistent daemon
here.

2. a user starts one application, and then starts another (would have to be
in a separate window or batch job as we do not support running mpirun in the
background) that connects to the first. The problem here is that neither
application knows the jobid of the other, has no info on how to communicate
with the other, nor knows a common rendezvous point. You would definitely
need a persistent daemon for this use-case.

I would have to review the code to see, but my best guess from what I
remember is that we don't actually support the second use-case at this time.
It would be possible to do so, albeit complicated - but I'm almost certain
nobody ever implemented it. I had talked at one time about providing the
necessary glue, either at the command line or (better) via some internal
"magic", but never got much interest - and so never did anything about
it...and I don't recall seeing anyone else make the necessary changes.

> 
>>> - It also likely doesn't work yet; we started the integration work
>>> and ran into a technical issue that required further discussion with
>>> Platform.  They're currently looking into it; we stopped the LSF work
>>> in ORTE until they get back to us.
>> 
>> i see -- i might be trying to work on the 6.x support today. can you
>> give me any hints on what the problem was in case i run into the same
>> issue?
> 
> Something was wrong with the lsb_launch() function; using it caused a
> significant slowdown in the job and it generally wasn't behaving as
> expected.  Platform issued a fix for me yesterday (i.e., a one-off/
> unsupported binary for development purposes) that I haven't gotten to
> test yet.
> 
>>> - That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
>>> offer a way out here.  But I think a) THREAD_MULTIPLE isn't working
>>> yet (other OMPI members are working on this), and b) even when
>>> THREAD_MULTIPLE works, there will be ORTE issues to deal with
>>> (canceling pending resource allocations, etc.).  Ralph mentioned that
>>> someone else is working on such things on the TM/PBS/Torque side; I
>>> haven't followed that effort closely.
>> 
>> it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
>> are perhaps other workarounds (using threads in other ways, etc.).
>> also, i'd love to hear about the existing efforts -- i'm hoping
>> someone working on them might be reading this ... ;)
> 
> Ralph -- can you chime in on the TM/PBS/Torque efforts?

It isn't my work. I can ask the other developer if he is interested in
talking with you and/or willing for me to make his work more public (part of
it has been discussed on the public user list). I believe this is part of
his PhD thesis, so I want to err on the side of caution here.








Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-17 Thread Jeff Squyres

On Jul 16, 2007, at 2:28 PM, Matthew Moskewicz wrote:


MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.


perhaps i'll experiment -- any clues as to what the orte  
restrictions might be?


The main constraint is that you have to run a "persistent" orted that  
will span all your MPI_COMM_WORLD's.  We have only lightly tested  
this scenario -- Ralph, can you comment more here?



- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform.  They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.


i see -- i might be trying to work on the 6.x support today. can you
give me any hints on what the problem was in case i run into the same
issue?


Something was wrong with the lsb_launch() function; using it caused a  
significant slowdown in the job and it generally wasn't behaving as  
expected.  Platform issued a fix for me yesterday (i.e., a one-off/ 
unsupported binary for development purposes) that I haven't gotten to  
test yet.



- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here.  But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.).  Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.


it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
are perhaps other workarounds (using threads in other ways, etc.).
also, i'd love to hear about the existing efforts -- i'm hoping
someone working on them might be reading this ... ;)


Ralph -- can you chime in on the TM/PBS/Torque efforts?

--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] devel Digest, Vol 801, Issue 1

2007-07-16 Thread Matthew Moskewicz

hi again,


>>> i'll probably just continue experimenting on my own for the
>>> moment (tracking
>>> any updates to the main trunk LSF support) to see if i can figure
>>> it out. any
>>> advice the best way to get such back support into trunk, if and
>>> when if exists
>>> / is working?
>>
>> The *best* way would be for you to sign a third-party agreement -
>> see the
>>  web site for details and a copy. Barring that, the only option
>> would be to
>>  submit the code through either Jeff or I. We greatly prefer the
>> agreement
>>  method as it is (a) less burdensome on us and (b) gives you greater
>>  flexibility.
>
> i'll talk to 'the man' -- it should be okay ... eventually, at
> least ...

See http://www.open-mpi.org/community/contribute/ for details.  As an
open project, we always welcome new developers, but we do need to
keep the IP tidy.



will do.


MPI-2 does support the MPI_COMM_JOIN and MPI_COMM_ACCEPT/
MPI_COMM_CONNECT models.  We do support this in Open MPI, but the
restrictions (in terms of ORTE) may not be sufficient for you.


perhaps i'll experiment -- any clues as to what the orte restrictions might be?



Some other random notes in no particular order:

- As you noted, the LSF support is *very* new; it was just added last
week.

- It also likely doesn't work yet; we started the integration work
and ran into a technical issue that required further discussion with
Platform.  They're currently looking into it; we stopped the LSF work
in ORTE until they get back to us.



i see -- i might be trying to work on the 6.x support today. can you
give me any hints on what the problem was in case i run into the same
issue?


- FWIW, one of the main reasons OMPI/ORTE didn't add extensive/
flexible support for dynamic addition of resources was the potential
for queue time.  Many systems run "full" all the time, so if you try
to acquire more resources, you could just sit in a queue for minutes/
hours/days/weeks before getting nodes.  While it is certainly
possible to program with this model, we didn't really want to get
into the rats nest of corner cases that this would entail, especially
since very few users are asking for it.



yeah, it does seem like the queuing issue is critical. i think as long
as the requests for more resources are non-blocking, and the
application itself can deal with that, it shouldn't create too many
corner cases. in fact, if the application wants to block (potentially
for a long time) that might be okay too (i.e. on the initial big
allocation, just after some startup routine determines the needed
initial resources).


- That being said, MPI_THREAD_MULTIPLE and MPI_COMM_SPAWN *might*
offer a way out here.  But I think a) THREAD_MULTIPLE isn't working
yet (other OMPI members are working on this), and b) even when
THREAD_MULTIPLE works, there will be ORTE issues to deal with
(canceling pending resource allocations, etc.).  Ralph mentioned that
someone else is working on such things on the TM/PBS/Torque side; I
haven't followed that effort closely.



it seems that MPI_THREAD_MULTIPLE is to be avoided for now, but there
are perhaps other workarounds (using threads in other ways, etc.).
also, i'd love to hear about the existing efforts -- i'm hoping
someone working on them might be reading this ... ;)


> well, certainly part of the issue is the need (or at least strong
> preference) to support 6.2 -- but read on.
>

[SNIP LSF API info/guesswork]


I am certainly not an expert on LSF (nor its API) -- I only started
using it last week!  Do you have any contacts to ask at Platform?
They would likely be the best ones to discuss this with.


i'm in the same boat. i'll try to talk to the people here at cadence
that might have said contacts at Platform.



--
Jeff Squyres
Cisco Systems




Matt.