On Aug 22, 2006, at 3:34 PM, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 22 Aug 2006 14:21 -0400:
Ok. MX does this underneath the covers for me. If my send is small
(<32 KB by default), it buffers the data and the send can complete
immediately. Larger sends are held on the sender until a matching
receive is posted. If the receive is not posted before the sender
cancels the send, there is nothing else to clean up.
Lucky you. :)
See, Patrick's rant was worthwhile. ;-)
This brings up an possible case. What if peer A sends what it thinks
is an expected small message to B and B never posts a receive for it?
Since it is small, A's MX will buffer the message and send to B's MX
lib. The sender will get an immediate completion. If the B never
posts a matching receive, it will sit in the MX unexpected queue
indefinitely. How long should I let a message sit in the unexpected
queue before deleting it? I do not want to delete it immediately in
case the local BMI is about to post a matching receive for it.
That's tricky. In practice it's not an issue because all BMI users
do a request/response from client to server currently. You'll have
to hang onto it as long as the application is up, maybe you'll be
able to clear it when BMI_DROP_ADDR gets called for that particular
client.
Hanging on to it is the default. :-)
I could add the BMI_DROP_ADDR function and clean up. If it is never
called, the message will just sit in the MX unexpected queue and take
up space. I think I can make it so that on shutdown I can dump the MX
state including messages in the various queues. It could be helpful
for debugging.
Also, MX does not limit the number of unexpected messages, but it
does place a limit on memory used for unexpected messages (~2MB by
default? and is tunable). There are no queue pairs for limiting per
peer messages. If rate limiting (flow control) is a requirement, let
me know.
The reason I fret about flow control is how IB dies horribly if you
don't pay attention to it. As long as MX silently drops the
message, but doesn't kill the network, the client will happily retry
the unexpected message later.
Trust me, I would prefer to avoid it given the hoops I have to jump
through in Lustre. In GM, the sender would flood the network until
the receiver posted a receive. This is not a problem in MX. I will
start without flow control and we will see how BMI behaves. We can
always retrofit it after the fact.
MX is connection-less. I will pre-post a bunch of receives for
unexpected messages with a special bit mask. For expected receives, I
will post using the BMI tag as well as a identifier for the remote
peer. I can then test on a specific, per-peer receive (mx_recv()
followed by mx_test()) or on any available (expected or not) receive
(mx_recv() followed by mx_test_any()).
That sounds fine. If you have enough bits you can store the pointer
for the peer state structure itself, rather than some identifier, to
maybe avoid a hash lookup.
In practice, BMI_testcontext() is most frequently used so you'll
probably end up polling for any rather than a specific peer. If
that matters.
To a BMI user, an unexpected message always signals the start of a
new transaction, while an expected message continues an existing
transaction for which you've got outstanding state. Phil said in
his CAC paper, "This reduces complexity on the server side because
the server does not have to anticipate buffer use in advance."
Can you clarify the "you" in the "while an expected message continues
an existing
transaction for which you've got outstanding state"? Does "you" mean
the BMI/MX method or BMI/PVFS? From what it looks like, I have no
state (unless I need to rate limit sends).
Heh. "you" being the pvfs2-server process, for instance. A
transaction like an IO flow consists of many messages that are all
related to the same request/response pair. The server will remember
things like the file being used, offset for new data to be written,
etc. This is above BMI.
You will need a teensy bit of per-peer state to translate BMI's
struct method_addr into whatever MX uses to address remote nodes and
processes (and back). Hangs off a void* in there. Maybe want to
hold onto a user-understandable reprentation of the peer identity
for error or debug messages (char *peername in IB).
Sorry, I did not mean _any_ state in the MX method. Sam, Murali and I
discussed these. I meant to say that the MX method would not have any
knowledge of how one BMI send (or receive) may or may not relate to
another BMI send (or receive).
Presumably you'll have a queue of outstanding sends and receives
somewhere too, but this doesn't have to be per-peer in your case.
I will probably have per-peer send queues so that I can queue
messages until I can build my MX endpoint address for that peer (call
mx_iconnect()). MX will manage the receive queue (I call mx_isend())
and I simply test for completions using mx_test() or mx_test_any(). I
will need a queue of canceled requests to return in
BMI_method_testcontext().
Is the purpose of the RTS/CTS messages then to stall the sending
until the receiver has posted the receive?
Yes. (I'm just talking about IB again. GM may be similar.)
This is unnecessary then for MX. If the send is expected, I can
simply call mx_isend() (same semantics as MPI_Isend()). If it is
small, MX will buffer and send it. If large, it will wait until the
peer posts a receive. If the peer fails to post a receive or if it is
gone (crashed, etc.), can I assume that BMI or higher will manage the
timeout and call BMI_method_cancel() on the send?
If so, I do not need either RTS or CTS messages.
That's a safe assumption. You will have to get the tag matching
correct to make sure messages demux into the right spots for the
case of multiple concurrent clients, or even a multi-threaded
client doing multiple requests, but that's hopefully doable.
MX provides 64 bits of match info for sends/recvs. I plan to
partition the bits as follows:
bits comment
0-3 msg_type (conn_req, conn_ack, expected, unexpected, ...?)
4-7 reserved for credits (if used)
8-15 reserved
15-31 peer id (16 bits => 65K peers)
32-63 BMI tag
The conn_req and con_ack message types allow me to do a simple
handshake to establish MX state (hostname, NIC index, endpoint index)
and agree on a version number. I'm reserving bits for credits should
flow control be necessary and another 8 bits for future use if
needed. I am assuming that 65K peers will be enough for awhile. If
you think that PVFS will be deployed on systems with more than 65K
peers, I can pull from the reserved bits to increase this id. Lastly,
I will pass the BMI tag, which is 32 bits.
I am assuming that a call to BMI_post_send() will have a matching
call to BMI_post_recv() on the remote peer and that both will share
the same BMI tag. If not, then please let me know.
If so, would the receiver
ever send a CTS to indicate that a match is not forthcoming?
No. Not sure how such information would help the sender.
This is a possible case in Lustre. If the receiver cannot find a
matching buffer (either bad request or lack of resources), it will
let me know to NAK the send request (send a CTS that indicates
failure).
That's reasonable. In PVFS this is figured out before BMI gets told
to send a big message, but arguably the network layer could be more
involved in the storage protocol.
Who manages the timeout? The BMI method (i.e. me) or something higher
in BMI/PVFS? Can I assume all send and receives are subject to
timeout?
Higher in PVFS (not BMI). Can't really assume it though. Sometimes
we turn it off for debugging. But normal use wants a configurable
timeout to detect peer failure. It's 30 sec by default, in fs.conf.
-- Pete
Excellent. This is all I think I need for now (except for
confirmation of my assumption above about the sender and receiver
using matching tags).
Thanks,
Scott
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers