I. Introduction
After posting our design to the mailing list, we received comments concerning
various aspects of the
design from Sean Hefty, Ira Weiny, Jason Gunthorpe, and Doug Ledford. Thank you
all for the help.
The main issues are listed below:
1. Extensibility: the design should be flexible and readily extended to other
applications;
2. Multiple data records: a query can return multiple data records (eg multiple
pathrecords);
3. Existing code: the design should use existing code as much as possible;
4. Various query points in the kernel: what are the requirements (parameters,
expected results) for
various queries that may exist in the kernel (IPoIB, RDMA CM, etc).
As our subject title indicates, we are trying to design for the kernel to query
a local user-space
service, more specifically, for the ib_sa module to send a pathrecord query to
a local user-space SA cache.
If anyone has information or requirements for other kernel query points, we
will be happy to know.
In our previous design, we created a data header to contain various information
about the query and
response:
struct ib_nl_data_hdr {
__u8 version;
__u8 opcode;
__u16 status;
__u16 type;
__u16 reserved;
__u32 flags;
__u32 length;
};
This was modeled after the ibacm messages and the message layout is diagrammed
below:
+----------------+
| netlink header |
+----------------+
| Data header |
+----------------+
| Data |
+----------------+
The design was extensible, but suffered from the fact that it did not take full
use of the netlink
message header.
In this version of the design, we will make full use of the netlink header and
the existing attribute
interface, as detailed below.
II. Message layout
The general message layout is shown here:
+----------------+
| netlink header |
+----------------+
| Attribute 1 |
+----------------+
| Attribute 2 |
+----------------+
| ... |
+----------------+
| Attribute N |
+----------------+
The number of attributes present in the request/response varies. As shown,
there is no new data
header to describe either the request nor the response. The netlink header and
various attributes
will be described later.
III. Netlink protocol, multicast group, and kernel client
This design is targeted to the NETLINK_RDMA protocol, and a new multicast group
RDMA_NL_GROUP_LS is
added for the local service:
enum {
RDMA_NL_GROUP_CM = 1,
RDMA_NL_GROUP_IWPM,
RDMA_NL_GROUP_LS,
RDMA_NL_NUM_GROUPS
};
In addition, each kernel client should define a client index so that the common
rdma code could
route the response to the right client. For this purpose, we define the
RDMA_NL_SA client for the
ib_sa module:
enum {
RDMA_NL_RDMA_CM = 1,
RDMA_NL_NES,
RDMA_NL_C4IW,
RDMA_NL_SA,
RDMA_NL_NUM_CLIENTS
};
As mentioned previously, each query point in the kernel should have its own
client index.
IV. Netlink message header
The netlink header is copied here:
struct nlmsghdr {
__u32 nlmsg_len; /* Length of message including header */
__u16 nlmsg_type; /* Message content */
__u16 nlmsg_flags; /* Additional flags */
__u32 nlmsg_seq; /* Sequence number */
__u32 nlmsg_pid; /* Sending process port ID */
};
The message type for rdma clients is also copied below:
#define RDMA_NL_GET_TYPE(client, op) ((client << 10) + op)
More clearly:
Bits Description
--------------------------
15-10 Client index
09-00 Opcode
As described previously, a netlink message is routed by protocol
(NETLINK_RDMA), multicast group
(RDMA_NL_LS), and client (encoded in the nlmsg_type field for rdma messages).
Therefore, the
opcode (encoded in nlmsg_type), the sequence number (nlmsg_seq) and addition
flags (nlmsg_flags)
are all local to the client. This is important when we define these fields as
they can overlap for
different clients.
(1) Opcode
The opcode for local service SA client is defined below:
enum {
RDMA_NL_LS_OP_RESOLVE = 0,
RDMA_NL_LS_OP_SET_TIMEOUT,
RDMA_NL_LS_NUM_OPS
};
The RESOLVE opcode is used by the ib_sa to send pathrecord query to the
user-space application
while the SET_TIMEOUT opcode can be used by the user-space application to set
the netlink timeout
value for the kernel client. Additional opcodes can be added if necessary.
It should be emphasized that the opcode is client specific and therefore can be
overlapped for
different clients. Therefore, the 10 bits should be large enough for various
requests.
(2) nlmsg_flags
This flags fields are again client specific. But the lower byte (bits 7-0) is
generally reserved
and the upper bits can be used to define request specific flags:
#define RDMA_NL_LS_F_OK 0x0100 /* Success response */
#define RDMA_NL_LS_F_ERR 0x0200 /* Failed response */
These two bits can be used to indicate whether a message is a response. If the
status is ERR, an
error code can be contained in a status attribute, as described low.
(3) Attribute type
Request parameters and response data records can be embedded in attributes.
The attribute header is copied here:
struct nlattr {
__u16 nla_len;
__u16 nla_type;
};
Each attribute is preceded by the attribute header and followed by attribute
specific data.
It should be reminded that attribute type is request (opcode) specific and
therefore could be
overloaded for different requests if needed.
For ib_sa RESOLVE query, the following attribute types are defined:
enum {
LS_NLA_TYPE_STATUS = 0,
LS_NLA_TYPE_ADDRESS,
LS_NLA_TYPE_PATH_RECORD,
LS_NLA_TYPE_MAX
};
(4) Status attribute
The status attribute is mostly used to carry error code if the RDMA_NL_LS_F_ERR
bits in nlmsg_flags
field in the netlink message header is set. If the response is success, there
is no need to include
this attribute in the response data (it's not an error, either).
num {
LS_NLA_STATUS_SUCCESS = 0,
LS_NLA_STATUS_INVAL,
LS_NLA_STATUS_ENODATA,
LS_NLA_STATUS_MAX
};
struct rdma_nla_ls_status {
__u32 status;
};
(5) Address attribute
This attribute is normally included in the RESOLVE request.
num {
LS_NLA_ADDR_F_SRC = 1,
LS_NLA_ADDR_F_DST = (1<<1),
LS_NLA_ADDR_F_HOSTNAME = {1<<2},
LS_NLA_ADDR_F_IPV4 = (1<<3),
LS_NLA_ADDR_F_IPV6 = (1<<4)
};
struct rdma_nla_ls_addr {
__u32 flags;
__u32 addr[0];
};
The address can be hostname (string), IPv4 or IPv6 address. The source and
destination flags are
also defined.
(6) Pathrecord attribute
This attribute can be included in both the RESOLVE request and response.
num {
LS_NLA_PATH_F_GMP = 1,
LS_NLA_PATH_F_PRIMARY = (1<<1),
LS_NLA_PATH_F_ALTERNATE = (1<<2),
LS_NLA_PATH_F_OUTBOUND = (1<<3),
LS_NLA_PATH_F_INBOUND = (1<<4),
LS_NLA_PATH_F_INBOUND_REVERSE = (1<<5),
LS_NLA_PATH_F_BIDIRECTIONAL = IB_PATH_OUTBOUND |
IB_PATH_INBOUND_REVERSE,
LS_NLA_PATH_F_USER = (1<6)
};
struct rdma_nla_ls_path_rec {
__u32 flags;
__u32 path_rec[0];
};
The format of the pathrecord can be indicated by the flags and the data is
contained in path_rec[].
For example, when LS_NLA_PATH_F_USER is set, the format is struct
ib_user_path_rec.
V. Summary
It's clear that this design is flexible, extensible, and can be easily enhanced
to address various
kernel query points. It uses the existing netlink message header and attribute
interface, and can
contain multiple attribute records.
Change since v1:
-- Completely revised the design to use netlink header and attribute interface.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html