On Apr 15, 2007, at 10:25 PM, chaitali dherange wrote:

To make things simple, we are making this scheduling static to some extent... by static I mean.. we know that our clusters use Infiniband for MPI ( from our study of the openmpi source code this precisely uses the 'mca_btl_openib_send()' from the ompi/mca/btl/ openib/btl_openib.c file) ... so all the non MPI communication can be assumed to be TCP communication using the 'mca_btl_tcp_send()' from the ompi/mca/btl/tcp/btl_tcp.c file.

To implement this we plan to implement the foll. simple algorithm:

- before calling the 'mca_btl_openib_send()' lock0(X);
- before calling the 'mca_btl_tcp_send()' lock1(X);

Algo:

1. Allow Lock0(x) -> Lock0(x);.. meaning Lock0(x) is followed by Lock0(x).
2. Allow Lock1(x) -> Lock1(x);
3. Do not allow Lock0(x) -> Lock1(x);
4. If Lock1(x) -> Lock0(x).... since MPI calls are to be higher priority over the non MPI ones.. in this case the non MPI communication should be paused and all the related data off course needs to be put into a queue(meaning the status of this should be saved in a queue). All other non MPI communications newer than this should also be added to this same queue. Now the MPI process trying to perform Lock0(x) should be allowed to complete and only when all the MPI communications are complete should the non MPI communication be allowed.

Currently we are working on a simple scheduling algorithm without giving any priorities to the 'MPI_send' calls.

However to implement the project fully, we have the following queries :(
-Can we abort or pause the non-MPI/TCP communication in any way???

Not really; the BTL interface was not really designed for that. Indeed, even if you wrote your own socket code to use TCP sockets outside of MPI / BTL / etc., you don't have full control of exactly what is sent (or when). For example, if you write(fd, ...) and then decide you want to pause it, how would you do so? You can stop calling write(), but that's not enough. The kernel may have copied your buffer to a lower level and may be progressing the actual send behind the scenes. So you haven't *guaranteed* that only one network interface is utilizing the host's resources (RAM, kernel, memory busses, etc.) at one time.

Indeed, the BTL interface is designed to acknowledge this asynchronicity -- it *assumes* that all network actions are non- blocking such that a "Send" action only *begins* the send; completion occurs later.

So even if you use the TCP BTL to queue up a bunch of writes, if you then get an IB BTL send request, there isn't a good way to tell the TCP BTL "stop doing anything until I tell you otherwise" (i.e., don't process incoming reads and don't progress any further writes). :-\

-Given the assumption that the non-MPI communication is TCP, can we
make use of the built in structures (i mean the buffer already used) in mca_btl_tcp_send() for the implementation of pt.4 in the above mentioned
algorithm??? and more importantly how?

Not really :-(. The BTLs, by design, are mutually unaware of each other. In fact, the BTLs are quite dumb (as intended). The design was to have the caller coordinate and perform any higher-level coordination and the BTLs are simple bit-movers between processes.

Using the BTL's directly, the best you might be able to do is to stop queuing up new messages to a secondary BTL until you have completions from all pending traffic on a primary BTL. That might still be interesting, but it may not give you everything that you want -- especially since a) I'm guessing that your ultimate goal may be to multi-schedule multiple communication libraries across the *same* interconnect, and b) given the asynchronous nature of parallel computing, you might be able to do a half-decent job of *sending* scheduling, but you may not be able to predict the behavior of *receive* scheduling (e.g., how can you predict/schedule that a low priority receive would not be occurring at the same time on the same node as a high priority send?).

Regards,
Chaitali
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

Reply via email to