Template Version: @(#)sac_nextcase %I% %G% SMI
This information is Copyright 2008 Sun Microsystems
1. Introduction
    1.1. Project/Component Working Name:
         New HCA Capabilities
    1.2. Name of Document Author/Supplier:
         Author:  Bill Taylor
    1.3  Date of This Document:
        21 November, 2008
4. Technical Description

New HCA Attributes & Capabilities
---------------------------------

4.1. Background

Over time, InfiniBand (IB) Host Channel Adapters (HCAs) have
accumulated a number of new features. Further, IB software has placed
new demands on HCAs. In order to accommodate these developments, this
proposal defines a series of new interfaces for the InfiniBand
Transport Framework (IBTF, PSARC/2002/132 and follow-on cases). The
items in this proposal are: firmware version information, SQD flag,
inline data, Reserved L_Key, LSO, CQ Moderation, IP classification
flags and detailed WR sizes.

4.2. Proposal

The proposal is to make additions to the IBTF Channel Interface (CI)
and Transport Interface (TI). The functionality added to the TI is for
use by IB clients. CI changes affect the framework interface to HCA
drivers.

All interface additions and changes in this proposal have a
micro/patch binding.

Transport Interface (ON Consolidation Private)

    ibt_map_mem_iov(): new map IOV function
    ibt_unmap_mem_iov(): new unmap IOV function
    ibt_modify_cq(): new CQ moderation call
    ibt_query_cq(): changed function signature
    ibt_attr_flags_t: Reserved L_Key permission flag, LSO flag
    ibt_chan_sizes_t: inline size
    ibt_hca_attr_t: firmware version fields, SQD flag, inline data size,
        Reserved L_Key flag & value, LSO max payload & hdr size, 
        CQ moderation max params, IP classification support flag,
        "detailed" WQE size info
    ibt_send_wr_t, ibt_wr_lso_t: LSO work request format
    ibt_status_t: WQE too small error
    ibt_wc_t: IP classification flags
    ibt_wr_flags_t: inline flag
    ibt_wrc_opcode_t: LSO opcode

Channel Interface (ON Consolidation Private)
 
    ibc_operations_t: ibc_cq_modify entry point, ibc_map_mem_iov entry point,
        changed ibc_query_cq signature, ibc_unmap_mem_iov entry point
    ibt_attr_flags_t: Reserved L_Key permission flag, LSO flag
    ibt_chan_sizes_t: inline size
    ibt_hca_attr_t: firmware version fields, SQD flag, inline data size,
        Reserved L_Key flag & value, LSO max payload & hdr size, 
        CQ moderation max params, IP classification support flag,
        "detailed" WQE size info
    ibt_send_wr_t, ibt_wr_lso_t: LSO work request format
    ibt_status_t: WQE too small error 
    ibt_wc_t: IP classification flags
    ibt_wr_flags_t: inline flag
    ibt_wrc_opcode_t: LSO opcode

All these changes are part of the v3 IBTF ABI first introduced in
PSARC/2008/630.

Copies of all modified/added man pages are in the materials directory
(see section 4.3 below).

A. Firmware Version

Occassionally, some IB software wants to know the firmware version of
the HCA. Up to now, this information was only known to the HCA driver
and firmware flash tool. The firmware version information takes the
form of three new fields added to the ibt_hca_attr_t struct:

    uint32_t    hca_fw_major_version;
    uint16_t    hca_fw_minor_version;
    uint16_t    hca_fw_micro_version;

Man page changes: ibt_hca_attr_t.9s

B. SQD flag

The IBTF baseline for HCA functionality is version 1.1 of the IB
specification. Unfortunately, the ConnectX device (serviced by the
Hermon HCA driver, PSARC/2008/497) currently does not implement some
parts of the 1.1 standard: most notably the Send Queue Drain (SQD)
Queue Pair (QP) state (see section 10.3.1.5 in the IB spec [1]). So
IBTF will now explicitly show when the SQD state is available.

To show when SQD is implemented, this flag is added to ib_hca_flags_t:

    IBT_HCA_SQD_STATE = 1 << 30,        /* SQD QP state */

If SQD is not supported, an attempt to transition to that state will
result in a IBT_CHAN_STATE_INVALID (TI) or IBT_QP_STATE_INVALID (CI)
error (already defined in the interface).

Man page changes: ibt_hca_attr_t.9s

C. Inline Data

IB work requests (WRs) include a scatter-gather list (SGL) of data
segments which reference the message payload. After posting, the WR is
stored in a device specific Work Queue Entry (WQE) format until
processing. Some HCAs support an "inline data" WQE format where the
SGL portion of the WQE is replaced by the actual payload data.

Using inline data can improve performance, since fetching the WQE also
retrieves the payload, eliminating subsequent DMAs to reference
through the WQE SGL. Further, data stored inline does not have to be
in registered memory, possibly removing a memory registration step.

There are limitations on this technique. Typically a WQE cannot be
very large, so this only works for small messages. Also the data must
be available at posting time, so there cannot be a data dependency
between WRs on a QP. Finally, this technique is limited to SEND and
RDMA-WRITE operations.

New field for HCA maximum size of inline data in ibt_hca_attr_t:
    uint_t      hca_max_inline_size;    /* in bytes */

    A value of zero indicates there is no data inline support.

New field for QP-specific size of inline data in ibt_chan_sizes_t:
    uint_t cs_inline;

    This field is used during both channel allocation/modification (to
    size the WQE) and during channel query (to show the current
    supported inline size).

New WR flag in ibt_wr_flags_t:
    #define IBT_WR_SEND_INLINE  (1 << 6)

    Ignore L_Keys in SGL and copy data inline when this WR is posted.
    Valid only for SEND and RDMA-WRITE WRs, ignored for other WRs.

New ibt_status_t value:
    IBT_CHAN_WQE_SZ_INSUFF = 417,       /* TI */
    #define IBT_QP_WQE_SZ_INSUFF IBT_CHAN_WQE_SZ_INSUFF    /* CI */

    Indicates that the specified inline data payload was too large.

Man page changes: ibc_post_send.9e., ibt_post_send.9f,
ibt_chan_sizes_t.9s, ibt_hca_attr_t.9s and ibt_send_wr_t.9s

D. Reserved L_Key

The Reserved L_Key concept was introduced in the 1.2 IB spec (see
10.6.4.3.2 in [1]). Basically, the idea is to provide a pre-defined
memory L_Key which references the whole I/O bus physical address space
of the HCA. This allows physical bus addresses to be used in WRs,
typically after doing a virtual to physical address mapping using
ibt_map_mem_iov(). This method is faster than memory registration,
because the HCA memory table state does not need to be updated. For
security reasons, it should be noted this is an L_Key (not an R_Key),
and therefore it cannot be used as the target of a remote RDMA. Also,
this is a privileged feature, not typically enabled in a user process
QP.

New Reserved L_Key support flag in ibt_hca_flags2_t:
    IBT_HCA2_RES_LKEY = 1 << 3,         /* Reserved L_Key */

New Reserved L_Key value field in ibt_hca_attr_t:
    ibt_lkey_t hca_reserved_lkey;       /* Reserved L_Key value */

New QP permission flag in ibt_attr_flags_t:
    IBT_FAST_REG_RES_LKEY       = (1 << 1)

    This flag is used on channel/QP allocation to enable Reserved
    L_Key. Query channel/QP operations read the value. Per the IB
    spec (see 11.2.4.1 in [1]), this permission also controls the Fast
    Memory Registration feature (which is not implemented yet).

New map memory IOV operation to fill out SGL list for Reserved L_Key:
    ibt_status_t ibt_map_mem_iov(ibt_hca_hdl_t hca_hdl,         /* TI */
        ibt_iov_attr_t *iov_attr, ibt_all_wr_t *wr,
        ibt_mi_hdl_t *mi_hdl);  

    ibc_status_t prefix_ibc_map_mem_iov(ibc_hca_hdl_t hca_hdl,  /* CI */
        ibt_iov_attr_t *iov_attr, ibt_all_wr_t *wr,
        ibc_mi_hdl_t *mi_hdl); 

New unmap memory IOV operations:
    ibt_status_t ibt_unmap_mem_iov(ibt_hca_hdl_t hca_hdl,       /* TI */
        ibt_mi_hdl_t mi_hdl);

    ibc_status_t prefix_ibc_unmap_mem_iov(ibt_hca_hdl_t hca_hdl, /* CI */
        ibc_mi_hdl_t mi_hdl);

Man page changes: ibci.9, ibti.9, ibc_alloc_qp.9e, ibc_map_mem_iov.9e,
ibt_alloc_rc_chan.9f, ibt_map_mem_iov.9f ibt_alloc_ud_chan.9f and
ibt_hca_attr_t.9s

E. Large Send Offload (LSO)

HCAs have implemented stateless offloads similar to those in Ethernet
NICs. In LSO, a large IP payload is sent down to the HCA along with a
"template" for the header. The header template is based on the IPonIB
encapsulation format for TCP/IP (IPv4 or v6). The HCA chops up the big
payload into pieces that fit with the header in MTU size packets (IB
MTU is typically 2 or 4 KB). The header in each packet is generated by
the HCA based on the template provided.

New field for HCA maximum LSO payload & header size in ibt_hca_attr_t:
    uint_t      hca_max_lso_size;       /* in bytes */
    uint_t      hca_max_lso_hdr_size;   /* in bytes */

    Zero values mean the HCA does not support LSO.

New flag for LSO usage at QP/channel allocation time in ibt_attr_flags_t:
    IBT_USES_LSO = (1 << 2)

New LSO work request/completion opcode value in ibt_wrc_opcode_t:
    #define IBT_WRC_SEND_LSO    11

New WR format for LSO in ibt_send_wr_t (new "wr" field union variant):
    typedef struct ibt_wr_lso_s {
        ibt_ud_dest_hdl_t       lso_ud_dest;    /* address handle */
        uint8_t                 *lso_hdr;       /* header template point */
        ib_msglen_t             lso_hdr_sz;     /* size of header */
        ib_msglen_t             lso_mss;        /* segment payload size */
    } ibt_wr_lso_t;

    Variant to the "wr" part of the send WR struct. The payload is
    segmented into lso_mss size pieces with the header generated from
    the template pointed to by lso_hdr (of size lso_hdr_size bytes).

New ibt_status_t error:
    IBT_CHAN_WQE_SZ_INSUFF = 417,       /* TI */
    #define IBT_QP_WQE_SZ_INSUFF IBT_CHAN_WQE_SZ_INSUFF    /* CI */

    Indicates that the specified LSO header was too large.

Man page changes: ibc_alloc_qp.9e, ibc_post_send.9e,
ibt_alloc_ud_channel.9f, ibt_post_send.9f, ibt_hca_attr_t.9s,
ibt_wc_t.9s, ibt_send_wr_t.9s, ibt_wr_lso_t.9s and ibt_wr_ud_t.9s

F. CQ Moderation

CQ Moderation is an IB version of interrupt moderation. Instead of
having completion notification (i.e. an interrupt) occur when the
first completion is added to a completion queue (CQ), it's now
adjustable based on the number of completions or a timeout.

New fields for HCA maximum CQ moderation values in ibt_hca_attr_t:
    uint_t hca_max_cq_mod_count;
    uint_t hca_max_cq_mod_usec;

    The maximum values for CQ moderation in terms completions
    (hca_max_cq_mod_count) or timeout in micro-seconds
    (hca_max_cq_mod_usec). Zero values for both indicate the HCA does
    not support this feature. Note, a value of 1 for count means the
    same thing as zero, but it's much easier to test for zero in code.

New Modify CQ HCA operation:
    ibt_status_t ibt_modify_cq(ibt_cq_hdl_t cq, uint_t count, 
        uint_t usec, uint_t reserved);  /* TI */

    ibt_status_t (*ibc_modify_cq)(ibt_hca_hdl_t hca, ibc_cq_hdl_t cq,
        uint_t count, uint_t usec, uint_t reserved);    /* CI */

    Set the CQ moderation parameters on a CQ. An unsupported value
    causes the IBT_INVALID_PARAM error. A value of zero for either
    count or usec disables that aspect of CQ moderation.

Modified Query CQ HCA operation:
    ibt_status_t ibt_query_cq(ibt_cq_hdl_t ibt_cq, uint_t *entries, 
        uint_t *count_p, uint_t *usec_p, uint_t *res_p);   /* TI */

                                        /* CI */
    ibt_status_t (*ibc_query_cq)(ibc_hca_hdl_t hca, ibc_cq_hdl_t cq,
        uint_t *entries, uint_t *count_p, uint_t *usec_p, uint_t *res_p);

    The last three arguments of each are new and show the current
    settings on a CQ. A value of zero for either count or usec shows
    that aspect of CQ moderation is disabled.

Man pages changed: ibci.9, ibti.9, ibc_modify_cq.9e, ibc_query_cq.9e.
ibt_modify_cq.9f, ibt_query_cq.9f, ibc_operations_t.9s and
ibt_hca_attr_t.9s

G. IP Classification Flags

If this HCA feature is supported, then when an IP packet is detected
on the UD transport type (the type used by IPonIB), a number of flags
are set in the work completion (WC) record to show what was found by
the HCA hardware.

New IP classification support flag in ibt_hca_attr_t:
    IBT_HCA2_IP_CLASS = 1 << 5  /* has IP classification flags */

New flags field in the ibt_wc_t:
    uint32_t    wc_detail;      /* UD: IPoIB flags */

New flag definitions for the wc_detail field:
    /* IPoIB flags for wc_detail field */
    #define IBT_WC_DETAIL_ALL_FLAGS_MASK    (0x0FC00000)
    #define IBT_WC_DETAIL_IPV4              (1 << 22)   /* IPv4 header */
    #define IBT_WC_DETAIL_IPV4_FRAG         (1 << 23)   /* IPv4 fragment */
    #define IBT_WC_DETAIL_IPV6              (1 << 24)   /* IPv6 header */
    #define IBT_WC_DETAIL_IPV4_OPT          (1 << 25)   /* IPv4 option hdr */
    #define IBT_WC_DETAIL_TCP               (1 << 26)   /* TCP header */
    #define IBT_WC_DETAIL_UDP               (1 << 27)   /* UDP header */

Man page changes: ibt_hca_attr_t.9s, ibt_wc_t.9s

H. Detailed WQE Sizes

The reality of HCA design often means that giving a single number for
SGL size per WR is not quite right. In some cases, like inline data,
and using ibt_map_mem_iov (with Reserved L_Key), ULPs will want to
know more detailed information about how much space there really is in
the WQE. 

New flag for support of "detailed" WQE sizes added to ib_hca_flags_t:
    IBT_HCA_WQE_SIZE_INFO = 1 << 29

New fields giving exact sizes for certain transport/work-request
combinations: 
    /* Inline data sizes in bytes, valid only if inline data is supported */
    uint_t      hca_ud_send_inline_sz;          /* UD Send */
    uint_t      hca_conn_send_inline_sz;        /* RC Send */
    uint_t      hca_conn_rdmaw_inline_overhead; /* RDMA-W overhead */
    /* SGL lengths */
    uint_t      hca_recv_sgl_sz;                /* Receive */
    uint_t      hca_ud_send_sgl_sz;             /* UD Send */
    uint_t      hca_conn_send_sgl_sz;           /* RC Send */
    uint_t      hca_conn_rdma_sgl_overhead;     /* RDMA-W/R overhead */

Man page changes: ibt_hca_attr_t.9s

4.3. Summary of changes by man page

Man Page                Disposition     Reasons for change
-----------------------------------------------------------
ibci.9                  changed         D, F
ibti.9                  changed         D, F

ibc_alloc_qp.9e         changed         D, E
ibc_map_mem_iov.9e      new             D
ibc_modify_cq.9e        new             F
ibc_query_cq.9e         changed         F
ibc_post_send.9e        changed         C, E

ibt_alloc_rc_chan.9f    changed         D
ibt_alloc_ud_chan.9f    changed         D, E
ibt_map_mem_iov.9f      new             D
ibt_modify_cq.9f        new             F
ibt_query_cq.9f         changed         F
ibt_post_send.9f        changed         C, E

ibc_operations_t.9s     changed         F
ibt_chan_sizes_t.9s     changed         C
ibt_hca_attr_t.9s       changed         A, B, C, D, E, F, G, H
ibt_send_wr_t.9s        changed         C, E
ibt_wc_t.9s             changed         E, G
ibt_wr_lso_t.9s         new             E
ibt_wr_ud_t.9s          changed         E

4.4. References

[1] InfiniBand Architecture Specification Volume 1, Release
1.2.1. InfiniBand Trade Association, 2007.

http://www.infinibandta.org/members/spec/V1r1_2_1.Release_12062007.zip
(requires IBTA member login)

6. Resources and Schedule
    6.4. Steering Committee requested information
        6.4.1. Consolidation C-team Name:
                ON
    6.5. ARC review type: FastTrack
    6.6. ARC Exposure: open


Reply via email to