Thanks a lot Jeff and Josh. 
Seems it will be quite an interesting task to implement a separate btl for 
xensocket (xs) or anything related to migration.  I plan to stick to initial 
design for the time which seems ugly but simple and quite efficient (at the 
moment). I have bundled xs with tcp btl. So instead of using tcp btl, 
interested parties will use xs btl, which supports tcp inherently. Now during 
the execution - which starts by normal tcp, if we see that both the endpoints 
are on same physical host, we construct the xensockets (two in fact, one per 
endpoint to receive data -- xs is unidirectional). Upon signal that xs are 
created and connected, we start to make progress through xs sock descriptors, 
which means that normal tcp socket descriptors are alive but not in-charge as 
no data is being send or received through them. When we migrate to other 
physical host, our plan is to somehow make xs_socket invalid, and resort to 
normal tcp sockets. If a new endpoint pair is detected on
 the new physical host, we will do the same what was done initially. 

I am not sure, if it is an efficient design, but in theory it seems 
interesting, although has slight overhead. The worst part of design is that it 
is highly tcp centric.  My current status is that I am able to run normal mpi 
programs on xs btl, but am having some problems with some benchmark programs 
using non blocking sends and receives coupled with MPI_Barrier(). Something 
somewhere somehow gets lost. 

Xensockets initially were non-blocking send/recv, and did not have the 
necessary code for supporting epoll/select. We had to add the necessary  code 
in the module so i am quite sure that they will work with the new opal/libevent.

Best Regards,
Muhammad Atif

----- Original Message ----
From: Josh Hursey <jjhur...@open-mpi.org>
To: Open MPI Developers <de...@open-mpi.org>
Sent: Wednesday, March 19, 2008 2:20:59 AM
Subject: Re: [OMPI devel] xensocket btl and migration

Muhammad,

With regard to your question on migration you will likely have to  
reload the BTL components when a migration occurs. Open MPI currently  
assumes that once the set of BTLs are decided upon in a process they  
are to be used until the application completes. There is some limited  
support for failover in which if one BTL 'fails' then it is  
disregarded and a previously defined alternative path is used. For  
example if between two peers Open MPI has the choice of using tcp or  
openib then it will use openib. If openib were to fail during the  
running of the job then it may be possible for Open MPI to fail over  
and use just tcp. I'm not sure how well tested this ability is, others  
can comment if you are interested in this.

However failover is not really want you are looking for. What it seem  
you are looking for is the ability to tell two processes that they  
should no longer communicate over tcp, but continue communication over  
xensockets or visa versa. One technique would be upon migration, if  
unload the BTLs (component_close) then reopen (component_open) and  
reselect (component_select) then reexchange the modex the processes  
should settle into the new configuration. You will have to make sure  
that any state Open MPI has cached such as network addresses and node  
name data is refreshed upon restart. Take a look at the checkpoint/ 
restart logic for how I do this in the code base ([opal|orte|ompi]/ 
runtime/*_cr.c).

It is likely that there is another, more efficient method but I don't  
have anything to point you to at the moment. One idea would be to add  
a refresh function to the modex which would force the reexchange of a  
single processes address set. There are a slew of problems with this  
that you will have to overcome including race conditions, but I think  
they can be surmounted.

I'd be interested in hearing your experiences implementing this in  
Open MPI. Let me know if I can be of any more help.

Cheers,
Josh

On Mar 9, 2008, at 6:13 AM, Muhammad Atif wrote:

> Okay guys.. with all your support and help in understanding ompi  
> architecture, I was able to get Xensocket to work.  Only minor  
> changes to the xensocket kernel module made it compatible with  
> libevent. I am getting results which are bad but I am sure, I have  
> to cleanup the code. At least my results have improved over native  
> netfront-netback of xen for messages of size larger than 1 MB.
>
> I started with making minor changes in the TCP btl, but it seems it  
> is not the best way, as changes are quite huge and it is better to  
> have separate dedicated btl for xensockets. As you guys might be  
> aware Xen supports live migration, now I have one stupid question.  
> My knowledge so far suggests that btl component is initialized only  
> once. The scerario here is if my guest os is migrated from one  
> physical node to another, and realizes that the communicating  
> processes are now on one physical host and they should abandon use  
> of TCP btl and make use of Xensocket btl. I am sure it would not  
> happen out of the box, but is it possible without making heavy  
> changes in the openmpi architecture?
> With the current design, i am running a mix of tcp and xensocket  
> btls, and endpoints check periodically if they are on same physical  
> host or not. This has quite a big penalty in terms of time.
>
> Another question is (good thing i am using email otherwise you guys  
> would beat the hell outta me, its such a basic question). I am not  
> able to track MPI_Recv(...) api call and its alike calls. Once in  
> the code of MPI_Recv(..) we give a call to rc =  
> MCA_PML_CALL(recv(buf, count ... ). This call goes to the macro, and  
> pml.recv(..) gets invoked (mca_pml_base_module_recv_fn_t          
> pml_recv;) . Where can I find the actual function? I get totally  
> lost when trying to pinpoint what exactly is happening. Basically, I  
> am looking for a place where tcp btl recv is getting called with all  
> the goodies and  parameters which were passed by the MPI programmer.  
> I hope I have made my question understandable.
>
> Best Regards,
> Muhammad Atif
>
>
> ----- Original Message ----
> From: Brian W. Barrett <brbar...@open-mpi.org>
> To: Open MPI Developers <de...@open-mpi.org>
> Sent: Wednesday, February 6, 2008 2:57:31 AM
> Subject: Re: [OMPI devel] xensocket - callbacks through OPAL/libevent
>
> On Mon, 4 Feb 2008, Muhammad Atif wrote:
>
> > I am trying to port xensockets to openmpi. In principle, I have the
> > framework and everything, but there seems to be a small issue, I  
> cannot
> > get libevent (or OPAL) to give callbacks for receive (or send) for
> > xensockets. I have tried to implement native code for xensockets  
> with
> > libevent library, again the same issue.  No call backs! . With  
> normal
> > sockets, callbacks do come easily.
> >
> > So question is, do the socket/file descriptors have to have some  
> special
> > mechanism attached to them to support callbacks for libevent/opal?  
> i.e
> > some structure/magic?. i.e. maybe the developers of xensockets did  
> not
> > add that callback/interrupt thing at the time of creation.  
> Xensockets is
> > open source, but my knowledge about these issues is limited. So I  
> though
> > some pointer in right direction might be useful.
>
> Yes and no :).  As you discovered, the OPAL interface just  
> repackages a
> library called libevent to handle its socket multiplexing.  Libevent  
> can
> use a number of different mechanisms to look for activity on sockets,
> including select() and poll() calls.  On Linux, it will generally use
> poll().  poll() requires some kernel support to do its thing, so if
> Xensockets doesn't implement the right magic to trigger poll() events,
> then libevent won't work for Xensockets.  There's really nothing you  
> can
> do from the Open MPI front to work around this issue -- it would  
> have to
> be fixed as part of Xensockets.
>
> > Second question is, what if we cannot have the callbacks. What is  
> the
> > recommended way to implement the btl component for such a device?  
> Do we
> > need to do this with event timers?
>
> Have a look at any of the BTLs that isn't TCP -- none of them use  
> libevent
> callbacks for progress.  Instead, they provide a progress function  
> as part
> of the BTL interface, which is called on a regular basis whenever  
> progress
> needs to be made.
>
> Brian
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  
> Try it now._______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel







      
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

Reply via email to