On Tue, 2015-05-26 at 14:03 +0000, Wan, Kaike wrote:
> I. Introduction
>
> After posting our design to the mailing list, we received comments concerning
> various aspects of the
> design from Sean Hefty, Ira Weiny, Jason Gunthorpe, and Doug Ledford. Thank
> you all for the help.
>
> The main issues are listed below:
> 1. Extensibility: the design should be flexible and readily extended to other
> applications;
> 2. Multiple data records: a query can return multiple data records (eg
> multiple pathrecords);
> 3. Existing code: the design should use existing code as much as possible;
> 4. Various query points in the kernel: what are the requirements (parameters,
> expected results) for
> various queries that may exist in the kernel (IPoIB, RDMA CM, etc).
>
> As our subject title indicates, we are trying to design for the kernel to
> query a local user-space
> service, more specifically, for the ib_sa module to send a pathrecord query
> to a local user-space SA cache.
> If anyone has information or requirements for other kernel query points, we
> will be happy to know.
>
> In our previous design, we created a data header to contain various
> information about the query and
> response:
>
> struct ib_nl_data_hdr {
> __u8 version;
> __u8 opcode;
> __u16 status;
> __u16 type;
> __u16 reserved;
> __u32 flags;
> __u32 length;
> };
>
> This was modeled after the ibacm messages and the message layout is
> diagrammed below:
>
> +----------------+
> | netlink header |
> +----------------+
> | Data header |
> +----------------+
> | Data |
> +----------------+
>
> The design was extensible, but suffered from the fact that it did not take
> full use of the netlink
> message header.
>
> In this version of the design, we will make full use of the netlink header
> and the existing attribute
> interface, as detailed below.
>
> II. Message layout
>
> The general message layout is shown here:
>
>
> +----------------+
> | netlink header |
> +----------------+
> | Attribute 1 |
> +----------------+
> | Attribute 2 |
> +----------------+
> | ... |
> +----------------+
> | Attribute N |
> +----------------+
>
> The number of attributes present in the request/response varies. As shown,
> there is no new data
> header to describe either the request nor the response. The netlink header
> and various attributes
> will be described later.
>
> III. Netlink protocol, multicast group, and kernel client
>
> This design is targeted to the NETLINK_RDMA protocol, and a new multicast
> group RDMA_NL_GROUP_LS is
> added for the local service:
>
> enum {
> RDMA_NL_GROUP_CM = 1,
> RDMA_NL_GROUP_IWPM,
> RDMA_NL_GROUP_LS,
> RDMA_NL_NUM_GROUPS
> };
>
> In addition, each kernel client should define a client index so that the
> common rdma code could
> route the response to the right client. For this purpose, we define the
> RDMA_NL_SA client for the
> ib_sa module:
>
> enum {
> RDMA_NL_RDMA_CM = 1,
> RDMA_NL_NES,
> RDMA_NL_C4IW,
> RDMA_NL_SA,
> RDMA_NL_NUM_CLIENTS
> };
>
> As mentioned previously, each query point in the kernel should have its own
> client index.
>
> IV. Netlink message header
>
> The netlink header is copied here:
>
> struct nlmsghdr {
> __u32 nlmsg_len; /* Length of message including header */
> __u16 nlmsg_type; /* Message content */
> __u16 nlmsg_flags; /* Additional flags */
> __u32 nlmsg_seq; /* Sequence number */
> __u32 nlmsg_pid; /* Sending process port ID */
> };
>
> The message type for rdma clients is also copied below:
>
> #define RDMA_NL_GET_TYPE(client, op) ((client << 10) + op)
>
> More clearly:
>
> Bits Description
> --------------------------
> 15-10 Client index
> 09-00 Opcode
>
> As described previously, a netlink message is routed by protocol
> (NETLINK_RDMA), multicast group
> (RDMA_NL_LS), and client (encoded in the nlmsg_type field for rdma messages).
> Therefore, the
> opcode (encoded in nlmsg_type), the sequence number (nlmsg_seq) and addition
> flags (nlmsg_flags)
> are all local to the client. This is important when we define these fields as
> they can overlap for
> different clients.
>
> (1) Opcode
>
> The opcode for local service SA client is defined below:
>
> enum {
> RDMA_NL_LS_OP_RESOLVE = 0,
> RDMA_NL_LS_OP_SET_TIMEOUT,
> RDMA_NL_LS_NUM_OPS
> };
>
> The RESOLVE opcode is used by the ib_sa to send pathrecord query to the
> user-space application
> while the SET_TIMEOUT opcode can be used by the user-space application to set
> the netlink timeout
> value for the kernel client. Additional opcodes can be added if necessary.
>
> It should be emphasized that the opcode is client specific and therefore can
> be overlapped for
> different clients. Therefore, the 10 bits should be large enough for various
> requests.
>
> (2) nlmsg_flags
>
> This flags fields are again client specific. But the lower byte (bits 7-0) is
> generally reserved
> and the upper bits can be used to define request specific flags:
>
> #define RDMA_NL_LS_F_OK 0x0100 /* Success response */
> #define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
>
> These two bits can be used to indicate whether a message is a response. If
> the status is ERR, an
> error code can be contained in a status attribute, as described low.
>
> (3) Attribute type
>
> Request parameters and response data records can be embedded in attributes.
>
> The attribute header is copied here:
>
> struct nlattr {
> __u16 nla_len;
> __u16 nla_type;
> };
>
> Each attribute is preceded by the attribute header and followed by attribute
> specific data.
>
> It should be reminded that attribute type is request (opcode) specific and
> therefore could be
> overloaded for different requests if needed.
>
> For ib_sa RESOLVE query, the following attribute types are defined:
>
> enum {
> LS_NLA_TYPE_STATUS = 0,
> LS_NLA_TYPE_ADDRESS,
> LS_NLA_TYPE_PATH_RECORD,
> LS_NLA_TYPE_MAX
> };
>
> (4) Status attribute
>
> The status attribute is mostly used to carry error code if the
> RDMA_NL_LS_F_ERR bits in nlmsg_flags
> field in the netlink message header is set. If the response is success, there
> is no need to include
> this attribute in the response data (it's not an error, either).
>
> num {
> LS_NLA_STATUS_SUCCESS = 0,
> LS_NLA_STATUS_INVAL,
> LS_NLA_STATUS_ENODATA,
> LS_NLA_STATUS_MAX
> };
>
> struct rdma_nla_ls_status {
> __u32 status;
> };
>
> (5) Address attribute
>
> This attribute is normally included in the RESOLVE request.
>
> num {
> LS_NLA_ADDR_F_SRC = 1,
> LS_NLA_ADDR_F_DST = (1<<1),
> LS_NLA_ADDR_F_HOSTNAME = {1<<2},
> LS_NLA_ADDR_F_IPV4 = (1<<3),
> LS_NLA_ADDR_F_IPV6 = (1<<4)
> };
>
> struct rdma_nla_ls_addr {
> __u32 flags;
> __u32 addr[0];
> };
>
> The address can be hostname (string), IPv4 or IPv6 address. The source and
> destination flags are
> also defined.
>
> (6) Pathrecord attribute
>
> This attribute can be included in both the RESOLVE request and response.
>
> num {
> LS_NLA_PATH_F_GMP = 1,
> LS_NLA_PATH_F_PRIMARY = (1<<1),
> LS_NLA_PATH_F_ALTERNATE = (1<<2),
> LS_NLA_PATH_F_OUTBOUND = (1<<3),
> LS_NLA_PATH_F_INBOUND = (1<<4),
> LS_NLA_PATH_F_INBOUND_REVERSE = (1<<5),
> LS_NLA_PATH_F_BIDIRECTIONAL = IB_PATH_OUTBOUND |
> IB_PATH_INBOUND_REVERSE,
> LS_NLA_PATH_F_USER = (1<6)
> };
>
> struct rdma_nla_ls_path_rec {
> __u32 flags;
> __u32 path_rec[0];
> };
>
> The format of the pathrecord can be indicated by the flags and the data is
> contained in path_rec[].
> For example, when LS_NLA_PATH_F_USER is set, the format is struct
> ib_user_path_rec.
>
> V. Summary
>
> It's clear that this design is flexible, extensible, and can be easily
> enhanced to address various
> kernel query points. It uses the existing netlink message header and
> attribute interface, and can
> contain multiple attribute records.
>
>
>
> Change since v1:
> -- Completely revised the design to use netlink header and attribute
> interface.On the face of it, this is a much improved design. -- Doug Ledford <[email protected]> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part
