Re: [Pvfs2-developers] BMI send implemtations

Pete Wyckoff Tue, 22 Aug 2006 12:34:53 -0700

[EMAIL PROTECTED] wrote on Tue, 22 Aug 2006 14:21 -0400:
> Ok. MX does this underneath the covers for me. If my send is small  
> (<32 KB by default), it buffers the data and the send can complete  
> immediately. Larger sends are held on the sender until a matching  
> receive is posted. If the receive is not posted before the sender  
> cancels the send, there is nothing else to clean up.


Lucky you.  :)

> This brings up an possible case. What if peer A sends what it thinks  
> is an expected small message to B and B never posts a receive for it?  
> Since it is small, A's MX will buffer the message and send to B's MX  
> lib. The sender will get an immediate completion. If the B never  
> posts a matching receive, it will sit in the MX unexpected queue  
> indefinitely. How long should I let a message sit in the unexpected  
> queue before deleting it? I do not want to delete it immediately in  
> case the local BMI is about to post a matching receive for it.

That's tricky.  In practice it's not an issue because all BMI users
do a request/response from client to server currently.  You'll have
to hang onto it as long as the application is up, maybe you'll be
able to clear it when BMI_DROP_ADDR gets called for that particular
client.

> Also, MX does not limit the number of unexpected messages, but it  
> does place a limit on memory used for unexpected messages (~2MB by  
> default? and is tunable). There are no queue pairs for limiting per  
> peer messages. If rate limiting (flow control) is a requirement, let  
> me know.

The reason I fret about flow control is how IB dies horribly if you
don't pay attention to it.  As long as MX silently drops the
message, but doesn't kill the network, the client will happily retry
the unexpected message later.

> MX is connection-less. I will pre-post a bunch of receives for  
> unexpected messages with a special bit mask. For expected receives, I  
> will post using the BMI tag as well as a identifier for the remote  
> peer. I can then test on a specific, per-peer receive (mx_recv()  
> followed by mx_test()) or on any available (expected or not) receive  
> (mx_recv() followed by mx_test_any()).

That sounds fine.  If you have enough bits you can store the pointer
for the peer state structure itself, rather than some identifier, to
maybe avoid a hash lookup.

In practice, BMI_testcontext() is most frequently used so you'll
probably end up polling for any rather than a specific peer.  If
that matters.

> >To a BMI user, an unexpected message always signals the start of a
> >new transaction, while an expected message continues an existing
> >transaction for which you've got outstanding state.  Phil said in
> >his CAC paper, "This reduces complexity on the server side because
> >the server does not have to anticipate buffer use in advance."
> 
> Can you clarify the "you" in the "while an expected message continues  
> an existing
> transaction for which you've got outstanding state"? Does "you" mean  
> the BMI/MX method or BMI/PVFS? From what it looks like, I have no  
> state (unless I need to rate limit sends).

Heh.  "you" being the pvfs2-server process, for instance.  A
transaction like an IO flow consists of many messages that are all
related to the same request/response pair.  The server will remember
things like the file being used, offset for new data to be written,
etc.  This is above BMI.

You will need a teensy bit of per-peer state to translate BMI's
struct method_addr into whatever MX uses to address remote nodes and
processes (and back).  Hangs off a void* in there.  Maybe want to
hold onto a user-understandable reprentation of the peer identity
for error or debug messages (char *peername in IB).

Presumably you'll have a queue of outstanding sends and receives
somewhere too, but this doesn't have to be per-peer in your case.

> >>Is the purpose of the RTS/CTS messages then to stall the sending
> >>until the receiver has posted the receive?
> >
> >Yes.   (I'm just talking about IB again.  GM may be similar.)
> 
> This is unnecessary then for MX. If the send is expected, I can  
> simply call mx_isend() (same semantics as MPI_Isend()). If it is  
> small, MX will buffer and send it. If large, it will wait until the  
> peer posts a receive. If the peer fails to post a receive or if it is  
> gone (crashed, etc.), can I assume that BMI or higher will manage the  
> timeout and call BMI_method_cancel() on the send?
> 
> If so, I do not need either RTS or CTS messages.

That's a safe assumption.  You will have to get the tag matching
correct to make sure messages demux into the right spots for the
case of multiple concurrent clients, or even a multi-threaded
client doing multiple requests, but that's hopefully doable.

> >>If so, would the receiver
> >>ever send a CTS to indicate that a match is not forthcoming?
> >
> >No.  Not sure how such information would help the sender.
> 
> This is a possible case in Lustre. If the receiver cannot find a  
> matching buffer (either bad request or lack of resources), it will  
> let me know to NAK the send request (send a CTS that indicates failure).

That's reasonable.  In PVFS this is figured out before BMI gets told
to send a big message, but arguably the network layer could be more
involved in the storage protocol.

> Who manages the timeout? The BMI method (i.e. me) or something higher  
> in BMI/PVFS? Can I assume all send and receives are subject to timeout?

Higher in PVFS (not BMI).  Can't really assume it though.  Sometimes
we turn it off for debugging.  But normal use wants a configurable
timeout to detect peer failure.  It's 30 sec by default, in fs.conf.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] BMI send implemtations

Reply via email to