Re: [OMPI devel] Location of binaries

2010-03-22 Thread Timothy Hayes
If I understood your question correctly, it's not really the MPI
implementation's duty to solve this issue. You either have to copy the
binaries to each machine manually or (more usually) each machine is given
access to a common shared file system.

Tim

On 22 March 2010 15:42, herbey zepeda  wrote:

> Hi,
>
> In open MPI, where are the binaries stored.
> Let's say I have a program P that adds the numbers in an array of length 10
> I want to distribute the execution between 2 computers A and B
> A adds from array[0] to array[4]
> B adds from array[5] to array[9]
>
> I understand that I have to tell mpi that machines A and B exist and that I
> want the processes to be exected as required.
>
> No problem with this, my confusion is in the implementation.
>
> lets say I am running the adding program P from machine C.
>
> When I execute the P program, how do computers A and B know what binary to
> execute? My binaries are in copmuter C!
>
> Does MPI copy the binaries to machines A and B from C? and then executes
> the program?
>
> How is the program P loaded to memory in A and B, is P stored on disk in A
> and B?
>
> Do I have to copy the P binaries in A and B prior to executing the program?
>
> When the program P has finished execution , what happens to the binaries.
>
> I have not found anything on the web to answer my question
>
> Thank you
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Question about ompi_proc_t

2009-12-07 Thread Timothy Hayes
Sorry, I think I read your question too quickly. Ignore me. :-)

2009/12/7 Timothy Hayes <haye...@tcd.ie>

> Is it not a forward definition and then defined in the PML components
> individually based on their own requirements?
>
> 2009/12/7 Pavel Shamis (Pasha) <pash...@gmail.com>
>
> In the ompi_proc_t structure (ompi/proc/proc.h:54) we keep pointer to
>> proc_pml - "struct mca_pml_base_endpoint_t* proc_pml" . I tired to find
>> definition for "struct mca_pml_base_endpoint_t" , but I failed. Does
>> somebody know where is it defined ?
>>
>> Regards,
>> Pasha
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>


Re: [OMPI devel] Question about ompi_proc_t

2009-12-07 Thread Timothy Hayes
Is it not a forward definition and then defined in the PML components
individually based on their own requirements?

2009/12/7 Pavel Shamis (Pasha) 

> In the ompi_proc_t structure (ompi/proc/proc.h:54) we keep pointer to
> proc_pml - "struct mca_pml_base_endpoint_t* proc_pml" . I tired to find
> definition for "struct mca_pml_base_endpoint_t" , but I failed. Does
> somebody know where is it defined ?
>
> Regards,
> Pasha
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] MPI Message Communication over TCP/IP

2009-05-06 Thread Timothy Hayes
Are you talking about my document? If so, that's no problem at all. If there
are any mistakes in my facts just let me know and I'll change them.

Tim

2009/5/6 Jeff Squyres <jsquy...@cisco.com>

> Thanks!
>
> Would you mind if I posted this on www.open-mpi.org?
>
>
>
> On Apr 25, 2009, at 10:05 AM, Timothy Hayes wrote:
>
> I uploaded it to http://www.hotshare.net/file/131218-829472246c.html
>>
>> I'm not sure if it's any good or even if it's 100% accurate; but if
>> someone gets any use out of it, that would be good.
>>
>> Tim
>> 2009/4/17 Jeff Squyres <jsquy...@cisco.com>
>> On Apr 16, 2009, at 11:38 AM, Timothy Hayes wrote:
>>
>> From what I understand MPI_Send will hit 3 separate layers of code before
>> reaching the socket file descriptors you've found. The PML (Point to Point
>> Messaging Layer) is a layer that bridges the MPI semantics from the
>> underlying point to point communications. The standard PML implementation is
>> called 'ob1' which is what indirectly calls the code you found. MPI_Send
>> should go through pml_isend() or pml_send() which will request a BTL (Byte
>> Transfer Layer) modules from the BML (BTL Management Layer) and invoke the
>> BTL's btl_prepare_src() or btl_alloc() functions before calling the
>> btl_send(). It becomes clearer when you step through it all with a debugger
>> though ;-)
>>
>> If you're interested, I've recently implemented a BTL component of my own
>> and am now writing up a report on the process. It will be ready next week,
>> so if you think it might be useful, just let me know.
>>
>> Ooohh... that would be positively yummy!  We can even host/link to that on
>> www.open-mpi.org.  :-)
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] MPI Message Communication over TCP/IP

2009-04-25 Thread Timothy Hayes
I uploaded it to http://www.hotshare.net/file/131218-829472246c.html

I'm not sure if it's any good or even if it's 100% accurate; but if someone
gets any use out of it, that would be good.

Tim
2009/4/17 Jeff Squyres <jsquy...@cisco.com>

> On Apr 16, 2009, at 11:38 AM, Timothy Hayes wrote:
>
> From what I understand MPI_Send will hit 3 separate layers of code before
>> reaching the socket file descriptors you've found. The PML (Point to Point
>> Messaging Layer) is a layer that bridges the MPI semantics from the
>> underlying point to point communications. The standard PML implementation is
>> called 'ob1' which is what indirectly calls the code you found. MPI_Send
>> should go through pml_isend() or pml_send() which will request a BTL (Byte
>> Transfer Layer) modules from the BML (BTL Management Layer) and invoke the
>> BTL's btl_prepare_src() or btl_alloc() functions before calling the
>> btl_send(). It becomes clearer when you step through it all with a debugger
>> though ;-)
>>
>> If you're interested, I've recently implemented a BTL component of my own
>> and am now writing up a report on the process. It will be ready next week,
>> so if you think it might be useful, just let me know.
>>
>
> Ooohh... that would be positively yummy!  We can even host/link to that on
> www.open-mpi.org.  :-)
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] MPI Message Communication over TCP/IP

2009-04-16 Thread Timothy Hayes
>From what I understand MPI_Send will hit 3 separate layers of code before
reaching the socket file descriptors you've found. The PML (Point to Point
Messaging Layer) is a layer that bridges the MPI semantics from the
underlying point to point communications. The standard PML implementation is
called 'ob1' which is what indirectly calls the code you found. MPI_Send
should go through pml_isend() or pml_send() which will request a BTL (Byte
Transfer Layer) modules from the BML (BTL Management Layer) and invoke the
BTL's btl_prepare_src() or btl_alloc() functions before calling the
btl_send(). It becomes clearer when you step through it all with a debugger
though ;-)

If you're interested, I've recently implemented a BTL component of my own
and am now writing up a report on the process. It will be ready next week,
so if you think it might be useful, just let me know.

Tim


2009/4/16 pranav jadhav 

>
> Hi All,
>
> I am new to MPI library. I downloaded and started using OpenMPI library to
> build MPI programs. I wanted to know how does MPI program communicates over
> the network. I am trying to trace the flow of MPI_Send and MPI_Bcast APIS to
> find the actual send/recv calls being used for sending the packets over the
> network. I was trying to look into the code , I found some tcp related
> socket connections in code base in "ompi/mca/btl/tcp/" but I am not able to
> figureout how does MPI_Send works. Please if anyone knows how MPI_Send and
> MPI_Recv APIs works for communicating over network,please let me know.
>
> Thankyou,
>
> Regards,
> Pranav Jadhav
> Stony Brook University
>
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] A Couple of Questions

2009-04-14 Thread Timothy Hayes
Yeah, that seemed to do the trick, thank you. I'm curious though, is there a
constant somewhere in code that indicates what the maximum size the PML
header can be? I appreciate it's 32 right now, but I never feel right just
hard coding that in.

2009/4/13 George Bosilca <bosi...@eecs.utk.edu>

>
> On Apr 12, 2009, at 21:58 , Timothy Hayes wrote:
>
>  I was wondering if someone might be able to shed some light on a couple of
>> questions I have.
>>
>> When you receive a fragment/base_descriptor in a BTL module, is the raw
>> data allowed to be fragmented when you invoke the callback function? By that
>> I mean, I'm using a circular buffer in each endpoint so sometimes data loops
>> back around. Currently I'm doing a two step copy: from my socket to the
>> circular buffer and then from the circular buffer to the fragment. This
>> actually effects my total throughput quite a bit, it would be much nicer to
>> just point to the buffer instead. When I tried using two base_segments to
>> point to the start and end of buffer I got some pretty strange errors. I'm
>> just wondering if someone could confirm or deny that you can or can't do
>> this, maybe those errors were down to human error instead.
>>
>
> On the descriptor you can set a number of iovec containing the raw data.
> You don't have to make it contiguous prior to calling up in the PML. I think
> the PML header has to be contiguous, so you have to make sure that the first
> 32 bytes of the message are contiguous.
>
>   My other question is about the BTL failover system. Would someone be able
>> to briefly explain how it works or maybe point me to some docs? I'm actually
>> expecting the file descriptors in my module to fail a certain point during
>> an Open MPI job and I'd like my BTL module to fail gracefully and allow the
>> TCP module to take over in its place. I'm not sure how to explicitly make
>> the the BTL module say to the rest of Open MPI "don't use my anymore"
>> though.
>>
>
> There is no way to say don't use me "at all" anymore. This is per peer
> based, so you will have to return an error on every peer. Try returning
> OMPI_ERR_OUT_OF_RESOURCE from all functions that allocate descriptors
> (_alloc, _prepare_src and _prepare_dst), and the PML will end-up removing
> this BTL from the list.
>
>  george.
>
>
>> Happy Easter
>> Tim
>>
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-26 Thread Timothy Hayes
It it was just a few kinks actually. I think the the bitmap type moved from
orte to opal, then I think the opal_hash_table functions changed slightly
and also I think the modex stuff was called something like pml_modex where
it's now ompi_modex. There were a few extra functions in the module
descriptor which I set to NULL, also the alloc/prepapre_src/dst functions
took some extra arguements. Finally the mca_btl_base_descriptor_t changed
slightly, it took an order variable.

I think that's it. In any case, it didn't take long at all to convert my
component. The problem *seems* to have gone away now, which is fantastic,
although now it's revealed something wrong in my own code so I'm back to
debugging in any case.

Thanks very much for all the advice.

Tim

2009/3/26 Lenny Verkhovsky <lenny.verkhov...@gmail.com>

> What is the error that you are getting from compilation failure?
>
> Lenny.
>
> On 3/23/09, Timothy Hayes <haye...@tcd.ie> wrote:
>>
>> That's a relief to know, although I'm still a bit concerned. I'm looking
>> at the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the
>> following sequence:
>>
>> mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list ->
>> MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait
>>
>> so I'm guessing unless the deadlock issue has been resolved for that
>> function, it will still fail non deterministically. I'm quite eager to give
>> it a try, but my component doesn't compile as is with the 1.3 source. Is it
>> trivial to convert it?
>>
>> Or maybe you were suggesting that I go into the code of ob1 myself and
>> manually change every _wait to _get?
>>
>> Kind regards
>> Tim
>>
>> 2009/3/23 George Bosilca <bosi...@eecs.utk.edu>
>>
>>> It is a known problem. When the freelist is empty going in the
>>> ompi_free_list_wait will block the process until at least one fragment
>>> became available. As a fragment can became available only when returned by
>>> the BTL, this can lead to deadlocks in some cases. The workaround is to ban
>>> the usage of the blocking _wait function, and replace it with the
>>> non-blocking version _get. The PML has all the required logic to deal with
>>> the cases where a fragment cannot be allocated. We changed most of the BTLs
>>> to use _get instead of _wait few months ago.
>>>
>>>  Thanks,
>>>george.
>>>
>>> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:
>>>
>>>  Hello,
>>>>
>>>> I'm working on an OpenMPI BTL component and am having a recurring
>>>> problem, I was wondering if anyone could shed some light on it. I have a
>>>> component that's quite straight forward, it uses a pair of lightweight
>>>> sockets to take advantage of being in a virtualised environment
>>>> (specifically Xen). My code is a bit messy and has lots of inefficiencies,
>>>> but the logic seems sound enough. I've been able to execute a few simple
>>>> programs successfully using the component, and they work most of the time.
>>>>
>>>> The problem I'm having is actually happening in higher layers,
>>>> specifically in my asynchronous receive handler, when I call the callback
>>>> function (cbfunc) that was set by the PML in the BTL initialisation phase.
>>>> It seems to be getting stuck in an infinite loop at 
>>>> __ompi_free_list_wait(),
>>>> in this function there is a condition variable which should get set
>>>> eventually but just doesn't. I've stepped through it with GDB and I get a
>>>> backtrace of something like this:
>>>>
>>>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
>>>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
>>>> __ompi_free_list_wait -> opal_condition_wait
>>>>
>>>> and from there it just loops. Although this is happening in higher
>>>> levels, I haven't noticed something like this happening in any of the other
>>>> BTL components so chances are there's something in my code that's causing
>>>> this. I very much doubt that it's actually waiting for a list item to be
>>>> returned since this infinite loop can occur non deterministically and
>>>> sometimes even on the first receive callback.
>>>>
>>>> I'm really not too sure what else to include with this e-mail. I could
>>>> send my source code (a bit nasty right now) if it would be helpful, but I'm
>>>> hopin

Re: [OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread Timothy Hayes
That's a relief to know, although I'm still a bit concerned. I'm looking at
the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the
following sequence:

mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list ->
MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait

so I'm guessing unless the deadlock issue has been resolved for that
function, it will still fail non deterministically. I'm quite eager to give
it a try, but my component doesn't compile as is with the 1.3 source. Is it
trivial to convert it?

Or maybe you were suggesting that I go into the code of ob1 myself and
manually change every _wait to _get?

Kind regards
Tim

2009/3/23 George Bosilca <bosi...@eecs.utk.edu>

> It is a known problem. When the freelist is empty going in the
> ompi_free_list_wait will block the process until at least one fragment
> became available. As a fragment can became available only when returned by
> the BTL, this can lead to deadlocks in some cases. The workaround is to ban
> the usage of the blocking _wait function, and replace it with the
> non-blocking version _get. The PML has all the required logic to deal with
> the cases where a fragment cannot be allocated. We changed most of the BTLs
> to use _get instead of _wait few months ago.
>
>  Thanks,
>    george.
>
>
> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote:
>
>  Hello,
>>
>> I'm working on an OpenMPI BTL component and am having a recurring problem,
>> I was wondering if anyone could shed some light on it. I have a component
>> that's quite straight forward, it uses a pair of lightweight sockets to take
>> advantage of being in a virtualised environment (specifically Xen). My code
>> is a bit messy and has lots of inefficiencies, but the logic seems sound
>> enough. I've been able to execute a few simple programs successfully using
>> the component, and they work most of the time.
>>
>> The problem I'm having is actually happening in higher layers,
>> specifically in my asynchronous receive handler, when I call the callback
>> function (cbfunc) that was set by the PML in the BTL initialisation phase.
>> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(),
>> in this function there is a condition variable which should get set
>> eventually but just doesn't. I've stepped through it with GDB and I get a
>> backtrace of something like this:
>>
>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
>> __ompi_free_list_wait -> opal_condition_wait
>>
>> and from there it just loops. Although this is happening in higher levels,
>> I haven't noticed something like this happening in any of the other BTL
>> components so chances are there's something in my code that's causing this.
>> I very much doubt that it's actually waiting for a list item to be returned
>> since this infinite loop can occur non deterministically and sometimes even
>> on the first receive callback.
>>
>> I'm really not too sure what else to include with this e-mail. I could
>> send my source code (a bit nasty right now) if it would be helpful, but I'm
>> hoping that someone might have noticed this problem before or something
>> similar. Maybe I'm making a common mistake. Any advice would be really
>> appreciated!
>>
>> I'm using OpenMPI 1.2.9 from the SVN tag repository.
>>
>> Kind regards
>> Tim Hayes
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


[OMPI devel] Infinite Loop: ompi_free_list_wait

2009-03-23 Thread Timothy Hayes
Hello,

I'm working on an OpenMPI BTL component and am having a recurring problem, I
was wondering if anyone could shed some light on it. I have a component
that's quite straight forward, it uses a pair of lightweight sockets to take
advantage of being in a virtualised environment (specifically Xen). My code
is a bit messy and has lots of inefficiencies, but the logic seems sound
enough. I've been able to execute a few simple programs successfully using
the component, and they work most of the time.

The problem I'm having is actually happening in higher layers, specifically
in my asynchronous receive handler, when I call the callback function
(cbfunc) that was set by the PML in the BTL initialisation phase. It seems
to be getting stuck in an infinite loop at __ompi_free_list_wait(), in this
function there is a condition variable which should get set eventually but
just doesn't. I've stepped through it with GDB and I get a backtrace of
something like this:

mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv ->
mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match ->
__ompi_free_list_wait -> opal_condition_wait

and from there it just loops. Although this is happening in higher levels, I
haven't noticed something like this happening in any of the other BTL
components so chances are there's something in my code that's causing this.
I very much doubt that it's actually waiting for a list item to be returned
since this infinite loop can occur non deterministically and sometimes even
on the first receive callback.

I'm really not too sure what else to include with this e-mail. I could send
my source code (a bit nasty right now) if it would be helpful, but I'm
hoping that someone might have noticed this problem before or something
similar. Maybe I'm making a common mistake. Any advice would be really
appreciated!

I'm using OpenMPI 1.2.9 from the SVN tag repository.

Kind regards
Tim Hayes


[OMPI devel] Manipulating OPAL event system

2009-03-11 Thread Timothy Hayes
Hi everyone,

I'm currently writing my own BTL component that utilises a lightweight Linux
socket module. It wouldn't have nearly as  much functionality as a TCP/IP
socket but it does the job and I managed to add a simple polling function
into the module, it sleeps for whatever amount of time is entered in user
space then checks various things (if any messages have come in) and returns
the mask. It's simple and probably not the best, but it works for the
moment. :-)

I'm curious as to how I can can get this into the OPAL event system. I see
it's based on libevent and after reading through the documentation I can see
what it does but now how to make it work in my circumstance. It says it
supports select(2) and poll(2) but when calling event_set() the only
parameter I seem to be able to set is EV_READ or EV_WRITE, there's nothing
further to indicate I want to poll the file descriptor (as opposed to
calling select on it). Maybe I'm missing something without realising, but
should my socket module be implementing some function to reveal what ways it
can be queried for the presence of information? If not, maybe somebody has
an idea of how I can make it work with the OPAL event system.

Any advice or information would be greatly appreciated.

Kind regards
Tim Hayes


Re: [OMPI devel] When can I use OOB channel?

2009-01-20 Thread Timothy Hayes
Hi Ralph,

I'm quite embarrassed, I misread the function prototype and was passing in
the actual proc_name rather than a pointer to it! It didn't complain when I
was compiling so I didn't think twice. It was silly mistake on my part in
any case! That RML tip is still handy though, thanks.

Cheers
Tim

2009/1/20 Ralph Castain <r...@lanl.gov>

> You sholud be able to use the OOB by that point in the system. However,
> that is the incorrect entry point for sending messages - you need to enter
> via the RML. The correct call is to orte_rml.send_nb.
>
> Or, if you are going to send a buffer instead of an iovec, then the call
> would be to orte_rml.send_buffer_nb.
>
> Ralph
>
>
>
> On Jan 19, 2009, at 1:01 PM, Timothy Hayes wrote:
>
>   Hello
>>
>> I'm in the midst of writing a BTL component, all is going well although
>> today I ran into something unexpected. In the
>> mca_btl_base_module_add_procs_fn_t function, I'm trying to call
>> mca_oob_tcp_send_nb() which is returning -12 (ORTE_ERR_UNREACH). Is this
>> normal or have I done something wrong? Is there a way around this? It would
>> be great if I could call this function in that particular area of code.
>>
>> Kind regards
>> Tim Hayes
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


[OMPI devel] When can I use OOB channel?

2009-01-19 Thread Timothy Hayes
Hello

I'm in the midst of writing a BTL component, all is going well although
today I ran into something unexpected. In the
mca_btl_base_module_add_procs_fn_t function, I'm trying to call
mca_oob_tcp_send_nb() which is returning -12 (ORTE_ERR_UNREACH). Is this
normal or have I done something wrong? Is there a way around this? It would
be great if I could call this function in that particular area of code.

Kind regards
Tim Hayes


Re: [OMPI devel] wiki: creating frameworks and components

2009-01-08 Thread Timothy Hayes
That's really helpful, thanks!

2009/1/8 Jeff Squyres 

> I created 3 wiki pages over the holidays:
>
>  * https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
>   How to add a new to / remove an old component from Open MPI
>  * https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
>   How to add a new to / remove an old framework from Open MPI
>  * https://svn.open-mpi.org/trac/ompi/wiki/devel/Autogen
>   The role of autogen in finding/configuring/building frameworks and
> components
>
> Feel free to edit with any mistakes / things that are not clear.
>  Hopefully, these pages will be most useful to those who need to add/remove
> components and frameworks.
>
> Enjoy!
>
> --
> Jeff Squyres
> Cisco Systems
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>


Re: [OMPI devel] Amateur Guidance

2008-11-07 Thread Timothy Hayes
Hi everyone,

Thank you all for your replies. I've now read those additional papers and
went through the slides of the Open MPI workshop. I'm still a bit hazy on
the architecture of Open MPI (especially relevant to my project) so what
I've done is written what I think I understand about process to process
communication. I have a few specific questions, but maybe you could point me
in the right direction if I'm way off or maybe expand on areas where I'm a
little vague.

http://macneill.cs.tcd.ie/~hayesti/ompi.jpg

N.B. The XEN component in the BTL layer represents what I'm trying to make.

When mpirun() is invoked, the following takes place

1.   An out of band TCP channel is established between the process and
every other process. This is located in the ORTE (Open Runtime Environment)
-> MCA (Modular Component Architecture) -> OOB (Out of Band) -> TCP.

2.   A PML (Point-to-Point  Management Layer) is created, defaulting to
'ob1' which can handle multiple communication interfaces simultaneously.
This is located in OMPI (Open MPI) -> MCA (Modular Component Architecture)
-> PML (Point-to-Point Management Layer) -> ob1

3.   'ob1' attempts to set up one or more BTLs (Byte Transport Layer)
components. These components are for establishing a point of contact with
another process for data transfer. Examples include loopback for itself,
shared memory for inter-process communication, TCP/IP for processes located
on separate machines. There exist specialist components like infinibands
should hardware and infrastructure become available.

4.   Each component is cohesive and is responsible for finding the
availability of resources specific to its own operation. Each component will
return zero, one or many module instances depending on circumstance.

5.   The out of band TCP channel is then used to communicate each
process' instantiated modules to every other process.

Questions with regard to the above

Is the OOB channel permanent for the duration of mpirun()?

I've read in places that the functions modex_send() & modex_recv() are used
to communicate on the OOB channel, but I see mca_oob_tcp_send and
mca_oob_tcp_recv declared in the header file. Is modex something else?

What exactly is queried and returned when a BTL component creates modules.
For example, if I run 4 MPI processes on the same machine, will the sm
component return 1 sm module to communicate with each other process or 3 sm
modules to communicate with 1 distinct module?

Once again, those 5 points are really sparse and they're sparse because I
don't know the detail myself. If anyone could shed some light on the process
I would be really grateful.

Kind regards

Tim Hayes

2008/11/3 Jeff Squyres 

> On Nov 3, 2008, at 10:39 AM, Eugene Loh wrote:
>
>  Main answer: no great docs to look at.  I think I've asked some OMPI
>> experts and that was basically the answer they gave me.
>>
>
> This is unfortunately the current state of the art -- no one has had time
> to write up good docs.
>
> Galen pointed to the new papers -- our main PML these days is "ob1" (teg
> died a long time ago).
>
> PML = Point to point messaging layer; it's basically the layer that is
> right behind MPI_SEND and friends.
>
> The ob1 PML uses BTL modules underneath.  BTL = Byte transfer layer;
> individual modules that send bytes back and forth over individual transports
> (e.g., shared memory, TCP, openfabrics, etc.).  There's a BTL for each of
> the major transports that we support.  The protocols that ob1 uses are
> described nicely in the papers that Galen sent, but the specific function
> interfaces are only best described in ompi/mca/btl/btl.h.
>
> Alternatively, we have a "cm" PML which uses MTL modules underneath.  MTL =
> Matching transport layer; it's basically for transports that expose very
> MPI-like interfaces (e.g., elan, tports, PSM, portals, MX).  This cm
> component is extremely thin; it basically provides a shim between Open MPI
> and the underlying transport.
>
> The big difference between cm and ob1 is that ob1 is a progress engine that
> tracks multiple transport interfaces (e.g., shared memory, tcp, openfabrics,
> ...etc. -- and therefore potentially multiple BTL module instances) and cm
> is a thin shim that simply translates between OMPI and the back-end
> interface -- cm will only use *ONE* MTL module instance.  Specifically: it
> is expected that the one MTL module will do all the progression, striping,
> ...or whatever it wants to do to move bytes from A to B by itself (very
> little/no help at all from OMPI's infrastructure).
>
> Does that help some?
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>