Hi all,
I have outlined what I intend to write for bmi_mx. Since MX is
designed to implement the MPI API, it closely matches the major
functions of PVFS (e.g. send, recv, test, test_any (aka testcontext),
etc.). Like MPI, MX provides for matching tags. Specifically, it
allows for 64 bits. I will use these bits to indicate the message
type (EXPECTED or UNEXPECTED for PVFS messages as well as for
connection messages for bmi_mx).
MX also does dynamic memory registration. There are no calls to [un]
register memory. Malloc() and free() will simply be malloc() and free().
In the text below, I do not mention posting unexpected receives. MX
has an unexpected callback handler. I will register a simple function
that gets an idle rx descriptor and posts a matching receive
immediately. This is preferred to pre-posting a bunch of generic,
unexpected receives, which slows down matching of expected receives.
In talking with another developer at lunch about PVFS and the lack of
a dedicated thread for bmi_mx, he will add a completion handler
function to the MX API. This will allow me to progress my internal
connection management messages without waiting for calls to
BMI_mx_testcontext() or BMI_testunexpected().
Lastly, cancellation of posted sends (and partially matched receives)
is a problematic operation. MX cannot guarantee that a peer has not
actually received a message when we try to cancel a send. The only
thing we can guarantee is that we can release the local buffer (i.e.
the peer will not be able to read from it from this point on).
Internally, MX needs to cleanup a lot of state to handle a send
cancellation so we recently added a new function to the API called
mx_disconnect(). This call cancels all outstanding operations with
the peer (posted sends and matched receives). I will call it when I
am asked to cancel a send or a receive that has already been matched
but not completed (i.e. partially received) and when a peer is
sending an internal bmi_mx connect message.
Any comments or suggestions?
Scott
--
Scott Atchley
Myricom Inc.
http://www.myri.com
Partition of Match Bits
bits comment
4 msg_type /* MX connect msgs, bmi_mx connect
msgs (conn_req, conn_ack), expected,
unexpected, ... */
4 credits /* reserve in case credits are used */
4 reserved /* for future use */
20 id /* receiver assigned id for the peer
for posting rxs for a specific peer to
distinguish rxs with the same BMI tag */
32 tag /* the BMI tag */
Conn Req msg:
bmi_mx version /* change if protocol changes */
name /* MX hostname excluding board */
board /* MX board */
endpoint /* MX endpoint */
peer_id (for B) /* Have peer use this id when sending to me */
[credits] /* max msgs in-flight */
Conn Ack msg:
peer_id (for A) /* or negative value if version mismatch, etc. */
Initial Send Unex
A B
| mx_iconnect() |
|---------------------->| /* msg_type bits = ICONN_CR */
| conn_req msg | /* msg_type bits = CONN_REQ */
|---------------------->|
| |
| mx_iconnect() |
|<----------------------| /* msg_type bits = ICONN_CA */
| conn_ack msg | /* stuff id in bits, use 0 byte msg? */
|<----------------------| /* msg_type bits = CONN_ACK */
| |
Peer State:
peername /* mx://host:board:endpoint */
name /* host */
board /* board */
endpoint /* endpoint */
my_id /* id assigned to me by peer */
peer_id /* id I assigned to peer */
mx_nic_id /* peer's MX nic_id */
mx_endpoint_addr_t /* MX endpoint address */
state /* INIT, WAIT, READY, DISCONNECT */
qlist queued_sends /* need connect, need tx descriptor */
qlist queued_recvs /* in DISCONNECT, wait until state back to INIT,
then post */
qlist pending_recvs /* in-flight recvs (in case of cancel) */
qlist peers /* for hanging on the global list of peers */
lock /* for serialization */
Msg Descriptor (tx/rx):
type /* TX/RX */
msg_type /* CONN_REQ, CONN_ACK, EXPECTED, UNEXPECTED */
qlist global_list /* hand on global list of TX or RX for cleanup */
qlist list /* hang on idle, queued and pending lists */
method_op /* BMI op */
state /* IDLE, PREP, PENDING, COMPLETED, CANCELED */
peer /* owning peer */
tag /* BMI supplied 32-bit msg id */
match_info /* MX match info (msg type [ | peer_id]) */
mxseg /* mx_segment_t for small messages */
buffer /* void * for small msg */
*mxsegs /* array of mx_segments pointing to list of bufs */
nseg /* number of segments */
nob /* number of bytes */
lock /* might not be needed */
Global State:
peername /* mx://host:board:endpoint */
name /* host */
board /* board */
endpoint /* endpoint */
qlist peers /* list of peers */
qlist txs /* list of txs (for cleanup) */
qlist idle_txs /* available txs */
qlist rxs /* list of rxs (for cleanup) */
qlist idle_rxs /* available rxs */
qlist cancelled_reqs /* called mx_cancel(), return in bmi_testcontext
() */
next_id /* id for next peer [1...2^20] */
lock /* for serialization */
Peer State on Client
Peer state starts at INIT. When calling mx_iconnect(), set state to
WAIT. When the mx_iconnect() request returns, post a receive for
CONN_ACK and send a CONN_REQ msg. When the CONN_ACK completes, set
state to READY. For each tx and rx, add a reference for the peer.
When any tx or rx completes, decrement the reference. If a request to
cancel a msg sets the state to DISCONNECT, wait until all pending txs
and rxs complete and decrement their references, set the state to
INIT and start over.
Peer State on Server
When a CONN_REQ rx completes, retrieve the peer info from the
endpoint addr context. If none is found, create a new peer. If found,
set state to DISCONNECT and cancel pending_recvs. Set the peer state
to INIT. Call mx_iconnect() and set the state to WAIT. When the
mx_iconnect() request returns, send a CONN_ACK and set state to
READY. For each tx and rx, add a reference for the peer. When any tx
or rx completes, decrement the reference. When any tx or rx
completes, decrement the reference. If a request to cancel a msg sets
the state to DISCONNECT, wait until all pending txs and rxs complete
and decrement their references, set the state to INIT and start over.
BMI_mx_post_send_common()
get idle tx
lookup peer using peername
if no peer
/* should happen on client only */
create peer
assign msg_type, peer, tag, method_op, nob
create match_info (msg_type | peer_id | tag)
map segment(s)
if unexpected
ensure length < EAGER_SIZE
switch peer state
case READY
add reference count on peer
send tx
break
case INIT
call mx_iconnect
/* fall through */
case WAIT
case DISCONNECT
append to queued_sends
break
BMI_mx_post_send()
call BMI_mx_post_send_common()
BMI_mx_post_send_list()
call BMI_mx_post_send_common()
BMI_mx_post_sendunexpected()
call BMI_mx_post_send_common() with unexpected flag
BMI_mx_post_sendunexpected_list()
call BMI_mx_post_send_common() with unexpected flag
BMI_mx_post_recv()
get idle rx
lookup peer using peername
if no peer
/* should happen on client only */
create peer
assign msg_type, peer, tag, method_op, nob
create match_info (msg_type | peer_id | tag)
map segment(s)
if unexpected
ensure length < EAGER_SIZE
switch peer state
case INIT
case WAIT
case READY
add reference count on peer
queue on pending_recvs
post rx
break
case DISCONNECT
/* we can't post it and add a ref if in DISCONNECT
because we need the ref count to go to 0 before
the state goes back to INIT */
queue on queued_recvs
break
BMI_mx_post_recv_list()
call BMI_mx_post_recv()
BMI_mx_test()
mx_test()
BMI_mx_testcontext()
handle_conn_reqs()
for 1 to incount
dequeue from cancelled_reqs
set outid, err, user_ptr
queue idle tx/rx
for completed to incount
mx_test_any() with EXPECTED bit mask
set outid, err, size, user_ptr
if rx
dequeue from pending_recvs
queue idle tx/rx
BMI_mx_testunexpected()
handle_conn_reqs()
mx_test_any() with UNEXPECTED bit mask
if found
update UI struct
queue idle rx
return 1
else
return 0
handle_conn_reqs()
do
mx_test_any() with ICONN_CR or ICONN_CA bit mask
switch type
case ICONN_CR
if success
get idle tx
send CONN_REQ
else
set peer state to DISCONNECT
drop queued rxs and txs
case ICONN_CA
if success
get idle tx
set peer state to READY
send CONN_ACK
send queued txs
else
set peer state to DISCONNECT
drop queued rxs and txs
while (request returned)
do
mx_test_any() with CONN_REQ or CONN_ACK bit mask
switch type
case TX
handle CONN TX completion
case RX
handle CONN RX completion
while (request returned)
handle CONN TX completion
if failed
set peer state to DISCONNECT
drop queued rxs and txs
put idle tx
handle CONN RX completion
if CONN_REQ
parse msg
mx_iconnect() with ICONN_CA
if the values don't match
set peer state to DISCONNECT
if CONN_ACK
if success
get my_id from match_info
set peer state to READY
send queued txs
else
set peer state to DISCONNECT
drop pending rxs and txs
put idle rx
BMI_mx_cancel()
if rx
mx_cancel(rx)
if SUCCESS, return SUCCESS
else
mx_test(rx)
if SUCCESS, return FAIL /* rx completed */
else
set peer state to DISCONNECT
mx_disconnect()
cancel pending_recvs
else /* tx */
set peer state to DISCONNECT
mx_disconnect()
cancel pending_recvs
BMI_mx_method_addr_lookup()
parse id
lookup peer in peers list
if !found
create a new peer
return method_addr *
BMI_mx_rev_lookup()
return peer's peername
BMI_mx_set_info() /* drop_addr (probe for unmatched, expected
messages and drop them) */
BMI_mx_get_info() /* unexpected size, drop_addr (probe for
unmatched, expected messages and drop them) */
BMI_mx_initialize()
alloc global peer state
alloc pool of rxs and txs
mx_init()
mx_open_endpoint()
mx_register_unexp_handler()
BMI_mx_finalize()
mx_wakeup()
mx_finalize()
BMI_mx_memalloc() /* malloc() */
BMI_mx_memfree() /* free() */
BMI_mx_open_context() /* return 0 */
BMI_mx_close_context() /* return 0 */
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers