Re: [OMPI devel] SOS... help needed :(

Jeff Squyres Mon, 16 Apr 2007 17:13:01 -0400

On Apr 15, 2007, at 10:25 PM, chaitali dherange wrote:

To make things simple, we are making this scheduling static to someextent... by static I mean.. we know that our clusters useInfiniband for MPI ( from our study of the openmpi source code thisprecisely uses the 'mca_btl_openib_send()' from the ompi/mca/btl/openib/btl_openib.c file) ... so all the non MPI communication canbe assumed to be TCP communication using the 'mca_btl_tcp_send()'from the ompi/mca/btl/tcp/btl_tcp.c file.
To implement this we plan to implement the foll. simple algorithm:

- before calling the 'mca_btl_openib_send()' lock0(X);
- before calling the 'mca_btl_tcp_send()' lock1(X);

Algo:
1. Allow Lock0(x) -> Lock0(x);.. meaning Lock0(x) is followed byLock0(x).
2. Allow Lock1(x) -> Lock1(x);
3. Do not allow Lock0(x) -> Lock1(x);
4. If Lock1(x) -> Lock0(x).... since MPI calls are to be higherpriority over the non MPI ones.. in this case the non MPIcommunication should be paused and all the related data off courseneeds to be put into a queue(meaning the status of this should besaved in a queue). All other non MPI communications newer than thisshould also be added to this same queue. Now the MPI process tryingto perform Lock0(x) should be allowed to complete and only when allthe MPI communications are complete should the non MPIcommunication be allowed.
Currently we are working on a simple scheduling algorithm withoutgiving any priorities to the 'MPI_send' calls.
However to implement the project fully, we have the followingqueries :(
-Can we abort or pause the non-MPI/TCP communication in any way???

Not really; the BTL interface was not really designed for that.Indeed, even if you wrote your own socket code to use TCP socketsoutside of MPI / BTL / etc., you don't have full control of exactlywhat is sent (or when). For example, if you write(fd, ...) and thendecide you want to pause it, how would you do so? You can stopcalling write(), but that's not enough. The kernel may have copiedyour buffer to a lower level and may be progressing the actual sendbehind the scenes. So you haven't *guaranteed* that only one networkinterface is utilizing the host's resources (RAM, kernel, memorybusses, etc.) at one time.

Indeed, the BTL interface is designed to acknowledge thisasynchronicity -- it *assumes* that all network actions are non-blocking such that a "Send" action only *begins* the send; completionoccurs later.

So even if you use the TCP BTL to queue up a bunch of writes, if youthen get an IB BTL send request, there isn't a good way to tell theTCP BTL "stop doing anything until I tell you otherwise" (i.e., don'tprocess incoming reads and don't progress any further writes). :-\

-Given the assumption that the non-MPI communication is TCP, can we
make use of the built in structures (i mean the buffer alreadyused) inmca_btl_tcp_send() for the implementation of pt.4 in the abovementioned
algorithm??? and more importantly how?

Not really :-(. The BTLs, by design, are mutually unaware of eachother. In fact, the BTLs are quite dumb (as intended). The designwas to have the caller coordinate and perform any higher-levelcoordination and the BTLs are simple bit-movers between processes.

Using the BTL's directly, the best you might be able to do is to stopqueuing up new messages to a secondary BTL until you have completionsfrom all pending traffic on a primary BTL. That might still beinteresting, but it may not give you everything that you want --especially since a) I'm guessing that your ultimate goal may be tomulti-schedule multiple communication libraries across the *same*interconnect, and b) given the asynchronous nature of parallelcomputing, you might be able to do a half-decent job of *sending*scheduling, but you may not be able to predict the behavior of*receive* scheduling (e.g., how can you predict/schedule that a lowpriority receive would not be occurring at the same time on the samenode as a high priority send?).

Regards,
Chaitali
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] SOS... help needed :(

Reply via email to