Hi all,

I have outlined what I intend to write for bmi_mx. Since MX is designed to implement the MPI API, it closely matches the major functions of PVFS (e.g. send, recv, test, test_any (aka testcontext), etc.). Like MPI, MX provides for matching tags. Specifically, it allows for 64 bits. I will use these bits to indicate the message type (EXPECTED or UNEXPECTED for PVFS messages as well as for connection messages for bmi_mx).

MX also does dynamic memory registration. There are no calls to [un] register memory. Malloc() and free() will simply be malloc() and free().

In the text below, I do not mention posting unexpected receives. MX has an unexpected callback handler. I will register a simple function that gets an idle rx descriptor and posts a matching receive immediately. This is preferred to pre-posting a bunch of generic, unexpected receives, which slows down matching of expected receives.

In talking with another developer at lunch about PVFS and the lack of a dedicated thread for bmi_mx, he will add a completion handler function to the MX API. This will allow me to progress my internal connection management messages without waiting for calls to BMI_mx_testcontext() or BMI_testunexpected().

Lastly, cancellation of posted sends (and partially matched receives) is a problematic operation. MX cannot guarantee that a peer has not actually received a message when we try to cancel a send. The only thing we can guarantee is that we can release the local buffer (i.e. the peer will not be able to read from it from this point on). Internally, MX needs to cleanup a lot of state to handle a send cancellation so we recently added a new function to the API called mx_disconnect(). This call cancels all outstanding operations with the peer (posted sends and matched receives). I will call it when I am asked to cancel a send or a receive that has already been matched but not completed (i.e. partially received) and when a peer is sending an internal bmi_mx connect message.

Any comments or suggestions?

Scott

--
Scott Atchley
Myricom Inc.
http://www.myri.com

Partition of Match Bits

bits    comment
4       msg_type        /* MX connect msgs, bmi_mx connect
                        msgs (conn_req, conn_ack), expected,
                        unexpected, ... */
4       credits         /* reserve in case credits are used */
4       reserved        /* for future use */
20      id              /* receiver assigned id for the peer
                        for posting rxs for a specific peer to
                        distinguish rxs with the same BMI tag */
32      tag             /* the BMI tag */


Conn Req msg:
bmi_mx version          /* change if protocol changes */
name                    /* MX hostname excluding board */
board                   /* MX board */
endpoint                /* MX endpoint */
peer_id (for B)         /* Have peer use this id when sending to me */
[credits]               /* max msgs in-flight */

Conn Ack msg:
peer_id (for A)         /* or negative value if version mismatch, etc. */

Initial Send Unex

A                       B
|    mx_iconnect()      |
|---------------------->|  /* msg_type bits = ICONN_CR */
|    conn_req msg       |  /* msg_type bits = CONN_REQ */
|---------------------->|
|                       |
|    mx_iconnect()      |
|<----------------------|  /* msg_type bits = ICONN_CA */
|    conn_ack msg       |  /* stuff id in bits, use 0 byte msg? */
|<----------------------|  /* msg_type bits = CONN_ACK */
|                       |

Peer State:
peername                /* mx://host:board:endpoint */
name                    /* host */
board                   /* board */
endpoint                /* endpoint */
my_id                   /* id assigned to me by peer */
peer_id                 /* id I assigned to peer */
mx_nic_id               /* peer's MX nic_id */
mx_endpoint_addr_t      /* MX endpoint address */
state                   /* INIT, WAIT, READY, DISCONNECT */
qlist queued_sends      /* need connect, need tx descriptor */
qlist queued_recvs /* in DISCONNECT, wait until state back to INIT, then post */
qlist pending_recvs     /* in-flight recvs (in case of cancel) */
qlist peers             /* for hanging on the global list of peers */
lock                    /* for serialization */

Msg Descriptor (tx/rx):
type                    /* TX/RX */
msg_type                /* CONN_REQ, CONN_ACK, EXPECTED, UNEXPECTED */
qlist global_list       /* hand on global list of TX or RX for cleanup */
qlist list              /* hang on idle, queued and pending lists */
method_op               /* BMI op */
state                   /* IDLE, PREP, PENDING, COMPLETED, CANCELED */
peer                    /* owning peer */
tag                     /* BMI supplied 32-bit msg id */
match_info              /* MX match info (msg type [ | peer_id]) */
mxseg                   /* mx_segment_t for small messages */
buffer                  /* void * for small msg */
*mxsegs                 /* array of mx_segments pointing to list of bufs */
nseg                    /* number of segments */
nob                     /* number of bytes */
lock                    /* might not be needed */

Global State:
peername                /* mx://host:board:endpoint */
name                    /* host */
board                   /* board */
endpoint                /* endpoint */
qlist peers             /* list of peers */
qlist txs               /* list of txs (for cleanup) */
qlist idle_txs          /* available txs */
qlist rxs               /* list of rxs (for cleanup) */
qlist idle_rxs          /* available rxs */
qlist cancelled_reqs /* called mx_cancel(), return in bmi_testcontext () */
next_id                 /* id for next peer [1...2^20] */
lock                    /* for serialization */


Peer State on Client
Peer state starts at INIT. When calling mx_iconnect(), set state to WAIT. When the mx_iconnect() request returns, post a receive for CONN_ACK and send a CONN_REQ msg. When the CONN_ACK completes, set state to READY. For each tx and rx, add a reference for the peer. When any tx or rx completes, decrement the reference. If a request to cancel a msg sets the state to DISCONNECT, wait until all pending txs and rxs complete and decrement their references, set the state to INIT and start over.

Peer State on Server
When a CONN_REQ rx completes, retrieve the peer info from the endpoint addr context. If none is found, create a new peer. If found, set state to DISCONNECT and cancel pending_recvs. Set the peer state to INIT. Call mx_iconnect() and set the state to WAIT. When the mx_iconnect() request returns, send a CONN_ACK and set state to READY. For each tx and rx, add a reference for the peer. When any tx or rx completes, decrement the reference. When any tx or rx completes, decrement the reference. If a request to cancel a msg sets the state to DISCONNECT, wait until all pending txs and rxs complete and decrement their references, set the state to INIT and start over.


BMI_mx_post_send_common()
        get idle tx
        lookup peer using peername
        if no peer
            /* should happen on client only */
            create peer
        assign msg_type, peer, tag, method_op, nob
        create match_info (msg_type | peer_id | tag)
        map segment(s)
        if unexpected
            ensure length < EAGER_SIZE
        switch peer state
            case READY
                add reference count on peer
                send tx
                break
            case INIT
                call mx_iconnect
                /* fall through */
            case WAIT
            case DISCONNECT
                append to queued_sends
                break

BMI_mx_post_send()
        call BMI_mx_post_send_common()

BMI_mx_post_send_list()
        call BMI_mx_post_send_common()

BMI_mx_post_sendunexpected()
        call BMI_mx_post_send_common() with unexpected flag

BMI_mx_post_sendunexpected_list()
        call BMI_mx_post_send_common() with unexpected flag

BMI_mx_post_recv()
        get idle rx
        lookup peer using peername
        if no peer
            /* should happen on client only */
            create peer
        assign msg_type, peer, tag, method_op, nob
        create match_info (msg_type | peer_id | tag)
        map segment(s)
        if unexpected
            ensure length < EAGER_SIZE
        switch peer state
            case INIT
            case WAIT
            case READY
                add reference count on peer
                queue on pending_recvs
                post rx
                break
            case DISCONNECT
                /* we can't post it and add a ref if in DISCONNECT
                   because we need the ref count to go to 0 before
                   the state goes back to INIT */
                queue on queued_recvs
                break

BMI_mx_post_recv_list()
        call BMI_mx_post_recv()

BMI_mx_test()
        mx_test()

BMI_mx_testcontext()
        handle_conn_reqs()
        for 1 to incount
            dequeue from cancelled_reqs
            set outid, err, user_ptr
            queue idle tx/rx
        for completed to incount
            mx_test_any() with EXPECTED bit mask
            set outid, err, size, user_ptr
            if rx
                dequeue from pending_recvs
            queue idle tx/rx

BMI_mx_testunexpected()
        handle_conn_reqs()
        mx_test_any() with UNEXPECTED bit mask
        if found
            update UI struct
            queue idle rx
            return 1
        else
            return 0

handle_conn_reqs()
        do
            mx_test_any() with ICONN_CR or ICONN_CA bit mask
            switch type
                case ICONN_CR
                    if success
                        get idle tx
                        send CONN_REQ
                    else
                        set peer state to DISCONNECT
                        drop queued rxs and txs
                case ICONN_CA
                    if success
                        get idle tx
                        set peer state to READY
                        send CONN_ACK
                        send queued txs
                    else
                        set peer state to DISCONNECT
                        drop queued rxs and txs
        while (request returned)
        do
            mx_test_any() with CONN_REQ or CONN_ACK bit mask
            switch type
                case TX
                    handle CONN TX completion
                case RX
                    handle CONN RX completion
        while (request returned)

handle CONN TX completion
        if failed
            set peer state to DISCONNECT
            drop queued rxs and txs
        put idle tx

handle CONN RX completion
        if CONN_REQ
            parse msg
            mx_iconnect() with ICONN_CA
            if the values don't match
                set peer state to DISCONNECT
        if CONN_ACK
            if success
                get my_id from match_info
                set peer state to READY
                send queued txs
            else
                set peer state to DISCONNECT
                drop pending rxs and txs
        put idle rx


BMI_mx_cancel()
        if rx
            mx_cancel(rx)
            if SUCCESS, return SUCCESS
            else
                mx_test(rx)
                if SUCCESS, return FAIL /* rx completed */
                else
                        set peer state to DISCONNECT
                        mx_disconnect()
                        cancel pending_recvs
        else /* tx */
            set peer state to DISCONNECT
            mx_disconnect()
            cancel pending_recvs

BMI_mx_method_addr_lookup()
        parse id
        lookup peer in peers list
        if !found
            create a new peer
        return method_addr *
        

BMI_mx_rev_lookup()
        return peer's peername

BMI_mx_set_info() /* drop_addr (probe for unmatched, expected messages and drop them) */

BMI_mx_get_info() /* unexpected size, drop_addr (probe for unmatched, expected messages and drop them) */

BMI_mx_initialize()
        alloc global peer state
        alloc pool of rxs and txs
        mx_init()
        mx_open_endpoint()
        mx_register_unexp_handler()

BMI_mx_finalize()
        mx_wakeup()
        mx_finalize()

BMI_mx_memalloc()               /* malloc() */
BMI_mx_memfree()                /* free() */
BMI_mx_open_context()           /* return 0 */
BMI_mx_close_context()          /* return 0 */
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to