Jack:
Thanks for adding this new function, this is what we need. There is one issue I want to make clear,

This new "kernel" owned QP "will be destroyed when the XRC domain is closed (i.e., as part of a ibv_close_xrc_domain call, but only when the domain's reference count goes to zero) "

If I have a MPI server processes on a node, many other MPI client processes will dynamically
connect/disconnect with the server. The server use same XRC domain.

Will this cause accumulating the "kernel" QP for such application ? we want the server to run 365 days
a year.


Thanks.
--CQ




-----Original Message-----
From: Pavel Shamis (Pasha) [mailto:pa...@dev.mellanox.co.il]
Sent: Thursday, December 20, 2007 9:15 AM
To: Jack Morgenstein
Cc: Tang, Changqing; Roland Dreier;
gene...@lists.openfabrics.org; Open MPI Developers;
mvapich-disc...@cse.ohio-state.edu
Subject: Re: [ofa-general] [RFC] XRC -- make receiving XRC QP
independent of any one user process

Adding Open MPI and MVAPICH community to the thread.

Pasha (Pavel Shamis)

Jack Morgenstein wrote:
background:  see "XRC Cleanup order issue thread" at



http://lists.openfabrics.org/pipermail/general/2007-December/043935.ht
ml

(userspace process which created the receiving XRC qp on a
given host
dies before other processes which still need to receive XRC
messages
on their SRQs which are "paired" with the now-destroyed
receiving XRC
QP.)

Solution: Add a userspace verb (as part of the XRC suite) which
enables the user process to create an XRC QP owned by the
kernel -- which belongs to the required XRC domain.

This QP will be destroyed when the XRC domain is closed
(i.e., as part
of a ibv_close_xrc_domain call, but only when the domain's
reference count goes to zero).

Below, I give the new userspace API for this function.  Any
feedback will be appreciated.
This API will be implemented in the upcoming OFED 1.3
release, so we need feedback ASAP.

Notes:
1. There is no query or destroy verb for this QP. There is
also no userspace object for the
  QP. Userspace has ONLY the raw qp number to use when
creating the (X)RC connection.

2. Since the QP is "owned" by kernel space, async events
for this QP are also handled in kernel
  space (i.e., reported in /var/log/messages). There are
no completion events for the QP, since
  it does not send, and all receives completions are
reported in the XRC SRQ's cq.

  If this QP enters the error state, the remote QP which
sends will start receiving RETRY_EXCEEDED
  errors, so the application will be aware of the failure.

- Jack

======================================================================
================
/**
* ibv_alloc_xrc_rcv_qp - creates an XRC QP for serving as
a receive-side only QP,
*    and moves the created qp through the RESET->INIT and
INIT->RTR transitions.
*      (The RTR->RTS transition is not needed, since this
QP does no sending).
*    The sending XRC QP uses this QP as destination, while
specifying an XRC SRQ
*    for actually receiving the transmissions and
generating all completions on the
*    receiving side.
*
*    This QP is created in kernel space, and persists
until the XRC domain is closed.
*    (i.e., its reference count goes to zero).
*
* @pd: protection domain to use.  At lower layer, this provides
access to userspace obj
* @xrc_domain: xrc domain to use for the QP.
* @attr: modify-qp attributes needed to bring the QP to RTR.
* @attr_mask:  bitmap indicating which attributes are
provided in the attr struct.
*    used for validity checking.
* @xrc_rcv_qpn: qp_num of created QP (if success). To be
passed to the remote node. The
*               remote node will use xrc_rcv_qpn in
ibv_post_send when sending to
*             XRC SRQ's on this host in the same xrc domain.
*
* RETURNS: success (0), or a (negative) error value.
*/

int ibv_alloc_xrc_rcv_qp(struct ibv_pd *pd,
                      struct ibv_xrc_domain *xrc_domain,
                      struct ibv_qp_attr *attr,
                      enum ibv_qp_attr_mask attr_mask,
                      uint32_t *xrc_rcv_qpn);

Notes:

1. Although the kernel creates the qp in the kernel's own
PD, we still need the PD
  parameter to determine the device.

2. I chose to use struct ibv_qp_attr, which is used in
modify QP, rather than create
  a new structure for this purpose.  This also guards
against API changes in the event
  that during development I notice that more modify-qp
parameters must be specified
  for this operation to work.

3. Table of the ibv_qp_attr parameters showing what values to set:

struct ibv_qp_attr {
     enum ibv_qp_state       qp_state;               Not needed
     enum ibv_qp_state       cur_qp_state;           Not needed
             -- Driver starts from RESET and takes qp to RTR.
     enum ibv_mtu            path_mtu;               Yes
     enum ibv_mig_state      path_mig_state;         Yes
     uint32_t                qkey;                   Yes
     uint32_t                rq_psn;                 Yes
     uint32_t                sq_psn;                 Not needed
     uint32_t                dest_qp_num;            Yes
-- this is the remote side QP for the RC conn.
     int                     qp_access_flags;        Yes
     struct ibv_qp_cap       cap;                    Need
only XRC domain.
                                                     Other
caps will use hard-coded values:

 max_send_wr = 1;

 max_recv_wr = 0;

 max_send_sge = 1;

 max_recv_sge = 0;

 max_inline_data = 0;
     struct ibv_ah_attr      ah_attr;                Yes
     struct ibv_ah_attr      alt_ah_attr;            Optional
     uint16_t                pkey_index;             Yes
     uint16_t                alt_pkey_index;         Optional
     uint8_t                 en_sqd_async_notify;    Not
needed (No sq)
     uint8_t                 sq_draining;            Not
needed (No sq)
     uint8_t                 max_rd_atomic;          Not
needed (No sq)
     uint8_t                 max_dest_rd_atomic;     Yes
-- Total max outstanding RDMAs expected
                                                     for
ALL srq destinations using this receive QP.
                                                     (if
you are only using SENDs, this value can be 0).
     uint8_t                 min_rnr_timer;          default - 0
     uint8_t                 port_num;               Yes
     uint8_t                 timeout;                Yes
     uint8_t                 retry_cnt;              Yes
     uint8_t                 rnr_retry;              Yes
     uint8_t                 alt_port_num;           Optional
     uint8_t                 alt_timeout;            Optional
};

4. Attribute mask bits to set:
     For RESET_to_INIT transition:
             IB_QP_ACCESS_FLAGS | IB_QP_PKEY_INDEX | IB_QP_PORT

     For INIT_to_RTR transition:
             IB_QP_AV | IB_QP_PATH_MTU |
             IB_QP_DEST_QPN | IB_QP_RQ_PSN | IB_QP_MIN_RNR_TIMER
        If you are using RDMA or atomics, also set:
             IB_QP_MAX_DEST_RD_ATOMIC


_______________________________________________
general mailing list
gene...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit
http://openib.org/mailman/listinfo/openib-general




--
Pavel Shamis (Pasha)
Mellanox Technologies


_______________________________________________
general mailing list
gene...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to