Template Version: @(#)sac_nextcase %I% %G% SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         More HCA Capabilities
    1.2. Name of Document Author/Supplier:
         Author:  Bill Taylor
    1.3  Date of This Document:
        29 January, 2009
4. Technical Description

More HCA Capabilities
---------------------

4.1 Background

This case introduces more InfiniBand (IB) Host Channel Adapter (HCA)
options for new features to the interfaces for the InfiniBand
Transport Framework (IBTF, PSARC/2002/132 and follow-on cases). The
items in this proposal are: Work Request Registration, Receive Side
Scaling, Multiple Completion Handlers and SRQ with UD flag. 

It is expected that all ULPs doing memory registration for the purpose
of RDMA, e.g. NFS-RDMA (PSARC/2007/347), SRP (SHARC/2002/458) and iSER
(PSARC/2008/395), will eventually switch to using Work Request
Registration. Receive Side Scaling together with Multiple Completion
Handlers is intended to boost IPonIB performance. SRQ with UD flag
would be used by IPonIB and possibly other UD OS bypass or management
clients.


4.2 Proposal

The proposal is to make additions to the IBTF Channel Interface (CI)
for HCA drivers and the Transport Interface (TI) for IB ULPs.

All interface additions and changes in this proposal have a
micro/patch binding.

Transport Interface (ON Consolidation Private)

    ibt_alloc_lkey(): new Allocate L_Key operation
    ibt_alloc_qp_range(): new Allocate range of QPs operation
    ibt_map_mem_area(): revised to support Registration WR
    ibt_modify_cq(): revise for multiple completion handlers
    ibt_query_cq(): revise for multiple completion handlers
    ibt_attr_flags_t: WR Registration flag, RSS flags
    ibt_cep_modify_flags_t: RSS flag
    ibt_chan_alloc_flags_t: RSS flag
    ibt_hca_attr_t: max PBL length, max RSS QP set size, max CQ handlers
    ibt_hca_flags(2)_t: WR Registration, RSS Algorithms, UD with SRQ flags
    ibt_rss_attr_t: RSS context state
    ibt_wc_t: new R_Key field, new RSS fields & flags
    ibt_wr_flags_t: remote R_Key invalidation flag
    ibt_wr_li_t: Local Invalidation WR
    ibt_wr_rc_t: new WR choices for RC
    ibt_wr_reg_pmr_t: Registration WR
    ibt_wrc_opcode_t: Registration, Local Invalidate opcodes

Channel Interface (ON Consolidation Private)

    ibc_operations_t: ibc_alloc_lkey, ibc_alloc_qp_range,
        ibc_map_mem_area (revised) entry points 
    ibt_attr_flags_t: WR Registration flag, RSS flags
    ibt_cep_modify_flags_t: RSS flag
    ibt_hca_attr_t: max PBL length, max RSS QP set size, max CQ handlers
    ibt_hca_flags(2)_t: WR Registration, RSS Algorithms, UD with SRQ flags
    ibt_qp_alloc_flags_t: RSS flag
    ibt_rss_attr_t: RSS context state
    ibt_wc_t: new R_Key field, new RSS fields & flags
    ibt_wr_flags_t: remote R_Key invalidation flag
    ibt_wr_li_t: Local Invalidation WR
    ibt_wr_rc_t: new WR choices for RC
    ibt_wr_reg_pmr_t: Registration WR
    ibt_wrc_opcode_t: Registration, Local Invalidate opcodes

All these changes are part of the v3 IBTF ABI first introduced in
PSARC/2008/630.

Copies of all modified/added man pages are in the materials directory
(see section 4.3 below)


A. Work Request Registration

In IB, "memory registration" adds information about memory buffers to
the HCA state. In the 1.2 IB spec, a new form of registration was
added, where the operation could be carried out through work requests
(WRs) on Queue Pairs (QPs), instead of using synchronous function
calls. The main idea is that performance of registration could be
improved by changing it into a WR, allowing asynchronous pipelined
processing through the QP.

In this style of memory registration, the process is effectively (a)
reserve space in the HCA memory tables, (b) map memory and put the
mapping information into WRs and (c) post the WRs on the
QP. Subsequent WRs can safely refer to the registered memory because
of IB rules on QP processing order.

Further, the 1.2 spec adds WRs to "invalidate" memory regions, which
effectively disables access to those memory regions; so now there is
also a way to locally shutoff access through QP operations too. This
operation can also be done from the remote side through a new option
to the existing "Send" operation (if permission checks are passed).

So the full interface addition includes not only flags and attributes
describing the function, but also new WRs and supporting functions for
what was described in (a) and (b) above. The latter is accomplished
through revisions to ibc_map_mem_area(9E) and ibt_map_mem_area(9F)
(originally introduced in PSARC/2005/546) to better fit with this
style of operation.

New flag for Work Request Registration support in ibt_hca_flags2_t:
    IBT_HCA2_MEM_MGT_EXT = 1 << 10, /* FMR-WR, send-inv, local-inv */

New field for the maximum Physical Buffer List (PBL) length in ibt_hca_attr_t:
    uint_t      hca_max_phys_buf_list_sz;

New Allocate L_Key operation to reserve space in the HCA memory tables:
    ibt_status_t ibt_alloc_lkey(ibt_hca_hdl_t hca_hdl,          /* TI */
        ibt_pd_hdl_t pd, ibt_lkey_flags_t flags,
        uint_t phys_buf_list_sz, ibt_mr_hdl_t *mr_p,
        ibt_pmr_desc_t *mem_desc_p);

    ibt_status_t prefix_ibc_alloc_lkey(ibc_hca_hdl_t hca_hdl,   /* CI */
        ibc_pd_hdl_t pd, ibt_lkey_flags_t flags,
        uint_t phys_buf_list_sz, ibc_mr_hdl_t *mr_p,
        ibt_pmr_desc_t *mem_desc_p); 

Revised operations for mapping memory (to better fit this style of operation):
    ibt_status_t ibt_map_mem_area(ibt_hca_hdl_t hca_hdl,        /* TI */
        ibt_va_attr_t *va_attrs, uint_t paddr_list_len,
        ibt_reg_req_t *reg_req, ibt_ma_hdl_t *ma_hdl_p);

     ibc_status_t prefix_ibc_map_mem_area(ibc_hca_hdl_t hca_hdl,
        ibt_va_attr_t *va_attrs, void *ibtl_reserved,           /* CI */
        uint_t paddr_list_len, ibt_reg_req_t *reg_req,
        ibc_ma_hdl_t *ma_hdl_p);

    NOTE: the matching "unmap" operations are not altered and remain
    as originally defined as PSARC/2005/546.

New choices for Reliable Connected transport WRs  (union for ibt_wr_rc_t):
    ibt_wr_reg_pmr_t    *reg_pmr;       /* WR Registration */
    ibt_wr_li_t         *li;            /* Local Invalidate */
    ibt_rkey_t          send_inval;     /* R_Key for Send w/invalidate */

New opcodes for new WR types (for ibt_wrc_opcode_t):
    #define     IBT_WRC_FAST_REG_PMR     9      /* Fast Register */
    #define     IBT_WRC_LOCAL_INVALIDATE 10     /* Invalidate Memory Region */

New flag for Remote Invalidate option for Send operation (in ibt_wr_flags_t):
    #define     IBT_SEND_REMOTE_INVAL   (1 << 4)

New Work Request type to register memory:
    typedef struct ibt_wr_reg_pmr_s {
        ib_vaddr_t      pmr_iova;       /* memory region virtual address */
        ib_memlen_t     pmr_len;        /* memory region length in bytes */
        ib_memlen_t     pmr_offset;     /* first byte offset in first page */
        ibt_mr_hdl_t    pmr_mr_hdl;     /* memory region handle */
        size_t          pmr_buf_sz;     /* page size */
        uint_t          pmr_num_buf;    /* list length */
        ibt_lkey_t      pmr_lkey;       /* L_Key for memory region */
        ibt_rkey_t      pmr_rkey;       /* R_Key for memory region */
        ibt_mr_flags_t  pmr_flags;      /* flags */
        uint8_t         pmr_key;        /* new low bits for L/R_Key */
     } ibt_wr_reg_pmr_t;

New Work Request type for local invalidation:
    typedef struct ibt_wr_li_s {
        ibt_mr_hdl_t    li_mr_hdl;      /* memory region handle */
        ibt_mw_hdl_t    li_mw_hdl;      /* for future mem window invalidates */
        ibt_lkey_t      li_lkey;        /* L_Key for memory region */
        ibt_rkey_t      li_rkey;        /* R_Key for memory region */
    } ibt_wr_li_t;

New R_Key field added to ibt_wc_t (completion) struct for R_Key invalidated:
    ibt_rkey_t          wc_rkey;

Man page changes: ibci.9, ibti.9, ibc_alloc_lkey.9e,
ibc_map_mem_area.9e, ibt_alloc_lkey.9f, ibt_map_mem_area.9f,
ibc_operations_t.9s, ibt_hca_attr_t.9s, ibt_send_wr_t.9s, ibt_wc_t.9s,
ibt_wr_rc_t.9s


B. Receive Side Scaling

Another stateless offload for IPonIB usage is Receive Side Scaling
(RSS). The idea here is to change the regular IPonIB QP into an RSS
"context". Messages received at the RSS context have their headers
hashed and then, based on the hash values, the messages are
distributed to a set of QPs. (If there is no hash match, the message
is passed to the "default" QP.) The set of QPs are in a consecutive
range of QP numbers, starting from a "base" QPN. Performance can be
increased because messages on each of the QPs can be processed in
parallel (see also section C below).

New flags for support of RSS hash algorithms in ibt_hca_flags2_t:
    IBT_HCA2_RSS_TPL_ALG        = 1 << 6,       /* RSS: Toeplitz algorithm */
    IBT_HCA2_RSS_XOR_ALG        = 1 << 7,       /* RSS: XOR algorithm */

New field for max size (in log2) of an RSS QP set in ibt_hca_attr_t:
    uint8_t     hca_rss_max_log2_table; /* max RSS log2 table size */

New flags to allocate and modify an RSS context:
    IBT_QP_USES_RSS     = (1 << 3)      /* CI: in ibt_qp_alloc_flags_t */
    IBT_ACHAN_USES_RSS  = (1 << 4)      /* TI: in ibt_chan_alloc_flags_t */

    IBT_CEP_SET_RSS     = (1 << 24)     /* CI & TI in ibt_cep_modify_flags_t */

New structs to set/query the values of an RSS context:
    typedef enum ibt_rss_flags_e {
        IBT_RSS_ALG_TPL         = (1 << 0),     /* RSS: Toeplitz hash */
        IBT_RSS_ALG_XOR         = (1 << 1),     /* RSS: XOR hash */
        IBT_RSS_HASH_IPV4       = (1 << 2),     /* RSS: hash IPv4 headers */
        IBT_RSS_HASH_IPV6       = (1 << 3),     /* RSS: hash IPv6 headers */
        IBT_RSS_HASH_TCP_IPV4   = (1 << 4),     /* RSS: hash TCP/IPv4 hdrs */
        IBT_RSS_HASH_TCP_IPV6   = (1 << 5)      /* RSS: hash TCP/IPv6 hdrs */
    } ibt_rss_flags_t;

    typedef struct ibt_rss_attr_s {
        ibt_rss_flags_t rss_flags;              /* RSS: flags */
        uint_t          rss_log2_table;         /* RSS: log2 table size */
        ib_qpn_t        rss_base_qpn;           /* RSS: base QPN for range */
        ib_qpn_t        rss_def_qpn;            /* RSS: default QPN */
        uint8_t         rss_toe_key[40];        /* RSS: Toeplitz hash key */
    } ibt_rss_attr_t;

    The ibt_rss_attr_t struct appears in ibt_qp_ud_addr_t for CI QP
    query and modify operations. For the TI, the struct appears in 
    ibt_ud_chan_alloc_args_t (for UD channel alloc),
    ibt_ud_chan_query_attr_t (for query) and in
    ibt_ud_chan_modify_attr_t (for modify).

New operation to allocate a range of UD QPs w/ consecutive aligned QP numbers:
    ibt_status_t ibt_alloc_ud_channel(ibt_hca_hdl_t hca_hdl,    /* TI */
        ibt_chan_alloc_flags_t flags, ibt_ud_chan_alloc_args_t *args,
        ibt_channel_hdl_t *ud_chan_p, ibt_chan_sizes_t *sizes)

    ibt_status_t prefix_alloc_qp_range(ibc_hca_hdl_t hca, uint_t log2,
        ibtl_qp_hdl_t *ibtl_qp_p, ibt_qp_type_t type,           /* CI */
        ibt_qp_alloc_attr_t *attr_p, ibt_chan_sizes_t *queue_sizes_p,
        ibc_cq_hdl_t *send_cq_p, ibc_cq_hdl_t *recv_cq_p,
        ib_qpn_t *qpn_p, ibc_qp_hdl_t *qp_p);

New field in Work Completion for the RSS hash value and flags in ibt_wc_t:
    uint32_t            wc_res_hash;    /* RSS 32-bit hash value */

    #define IBT_WC_DETAIL_RSS_MATCH_MASK  (0x003F0000)  /* wc_detail flags */
    #define IBT_WC_DETAIL_RSS_TCP_IPV6    (1 << 18)
    #define IBT_WC_DETAIL_RSS_IPV6        (1 << 19)
    #define IBT_WC_DETAIL_RSS_TCP_IPV4    (1 << 20)
    #define IBT_WC_DETAIL_RSS_IPV4        (1 << 21)

Changed man pages: ibci.9, ibti.9, ibc_alloc_qp.9e,
ibc_alloc_qp_range.9e, ibc_modify_qp.9e, ibt_alloc_ud_channel.9f,
ibt_alloc_ud_channel_range.9f, ibt_modify_ud_channel.9f,
ibt_query_ud_channel.9f, ibc_operations.9s, ibc_qp_info_t.9s,
ibt_hca_attr_t.9s, ibt_rss_attr_t.9s, ibt_wc_t.9s


C. Multiple Completion Handlers

Load spreading schemes like RSS are usually coupled with a method of
binding interrupts to multiple CPUs. In the IB 1.2, this concept is
represented by the "multiple completion handlers" feature. In IB
terms, each Completion Queue (CQ) then could be bound to a specified
completion handler to be used for completion notification. While
handlers are not required in the spec to be distinct MSI-X vectors
spread out over the available CPUs, this is the most obvious mapping
for most platforms.

In the particular case of IB-RSS, each QP in the RSS set would be
bound to a different CQ (or perhaps to two different CQs, one each for
send and receive). Each CQ in turn could be bound to any of the
available completion handlers. The version of "multiple completion
handlers" implemented here goes slightly beyond the IB spec in that
the assignment of CQs to completion handlers can be adjusted
dynamically for load balancing purposes.

New field in ibt_hca_attr to show how many completion handlers are on an HCA:
    uint_t              hca_max_cq_handlers; /* zero = no multiple handlers */

New field (last arg) in query CQ to show the current handler:
    ibt_status_t ibt_query_cq(ibt_cq_hdl_t ibt_cq, uint_t *entries,  /* TI */
        uint_t *count_p, uint_t *usec_p, ibt_cq_handler_id_t *hid_p)

   ibt_status_t prefix_ibc_query_cq(ibc_hca_hdl_t hca, ibc_cq_hdl_t cq,
        uint_t *entries, uint_t *count_p, uint_t *usec_p,        /* CI */
        ibt_cq_handler_id_t *hid_p);

    Note previously this last parameter was reserved. A value of zero
    means that multiple completion handlers are not supported. The
    value of IBT_CQ_HID_DEFAULT is the default handler given to CQs
    when initially allocated.

New field (last arg) to modify the CQ to completion handler assignment:
    ibt_status_t ibt_modify_cq(ibt_cq_hdl_t ibt_cq, uint_t count,  /* TI */
        uint_t usec, ibt_cq_handler_id_t hid);
 
    ibt_status_t prefix_ibc_modify_cq(ibc_hca_hdl_t hca, uint_t count,
        uint_t usec, ibt_cq_handler_id_t hid);          /* CI */

    Note previously this last parameter was reserved. A value of zero
    means no change. The value of IBT_CQ_HID_DEFAULT means the default
    handler.

Changed man pages: ibc_modify_cq.9e, ibt_query_cq.9e,
ibt_modify_cq.9f, ibt_query_cq.9f, ibt_hca_attr_t.9s.


D. SRQ with UD flag

Shared Receive Queues (SRQs) were originally introduced in
PSARC/2004/611 and then in uDAPL (PSARC/2004/737). While the IB spec
[1] says that when SRQ is available it should be supported on both the
Reliable Connected (RC) and Unreliable Datagram (UD) service, we have
found an adapter which has SRQ but omits support for UD. (Note that
the usage in uDAPL is with RC.) So we are now adding a flag so the
reality of whether SRQ is supported or not with UD can be determined
by UD applications which may want to use SRQ. The original SRQ flag
now only means that SRQ is supported with RC.

New flags added to ibt_hca_flags_t:
        IBT_HCA_RC_SRQ          = IBT_HCA_SRQ,  /* RC with SRQ */
        IBT_HCA_UD_SRQ          = 1 << 19       /* UD with SRQ */

Changed man pages: ibt_hca_attr_t.9s


4.3 Summary of changes by man page

Man Page                        Disposition     Reasons for change
------------------------------------------------------------------
ibci.9                          changed         A, B
ibti.9                          changed         A, B

ibc_alloc_lkey.9e               new             A
ibc_alloc_qp.9e                 changed         B
ibc_alloc_qp_range.9e           new             B
ibc_modify_cq.9e                changed         C
ibc_modify_qp.9e                changed         B
ibc_map_mem_area.9e             changed         A
ibc_query_cq.9e                 changed         C

ibt_alloc_lkey.9f               new             A
ibt_alloc_ud_channel.9f         changed         B
ibt_alloc_ud_channel_range.9f   new             B
ibt_map_mem_area.9f             changed         A
ibt_modify_cq.9f                changed         C
ibt_modify_ud_channel.9f        changed         B
ibt_query_cq.9f                 changed         C
ibt_query_ud_channel.9f         changed         B

ibc_operations_t.9s             changed         A, B
ibc_qp_info_t.9s                changed         B
ibt_hca_attr_t.9s               changed         A, B, C, D
ibt_rss_attr_t.9s               new             B
ibt_send_wr_t.9s                changed         A
ibt_wc_t.9s                     changed         A, B
ibt_wr_rc_t.9s                  changed         A


4.4 References

[1] InfiniBand Architecture Specification Volume 1, Release
    1.2.1. InfiniBand Trade Association, 2007.

    http://www.infinibandta.org/members/spec/V1r1_2_1.Release_12062007.zip
    (requires IBTA member login)


6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open


Reply via email to