On 09/17/2017 05:32 AM, Edan David wrote: > I removed the memset from virNetDevPutExtraHeader and tested on a card > supporting switchdev. > It looks fine to me :) > Great review guys, thanks! >
Thanks for checking... This is now pushed. Congrats on your first libvirt patch, John > > -----Original Message----- > From: John Ferlan [mailto:[email protected]] > Sent: Saturday, September 16, 2017 2:04 PM > To: Laine Stump <[email protected]>; [email protected] > Cc: Edan David <[email protected]> > Subject: Re: [libvirt] [PATCHv2] nodedev: add switchdev to NIC capabilities > > > > On 09/14/2017 11:23 AM, Laine Stump wrote: >> (Almost all of my comments result in "ok, no action needed". There are >> just three items I would like to see changed (2 trivial, 1 also small >> but Edan or John may think it prudent to re-test with the change >> before >> pushing) - I marked those comments with [**]. >> >> On 08/21/2017 05:19 AM, Edan David wrote: >>> Adding functionality to libvirt that will allow querying the >>> interface for the availability of switchdev Offloading NIC capabilities. >>> The switchdev mode was introduced in kernel 4.8, the iproute2-devlink >>> command to retrieve the swtichdev NIC feature, Command example: >>> devlink dev eswitch show pci/0000:03:00.0 This feature is needed for >>> Openstack so we can do a scheduling decision if the NIC is in >>> Hardware Offload (switchdev) or regular SR-IOV (legacy) mode. >>> And select the appropriate hypervisors with the requested capability see >>> [1]. >>> >>> [1] - >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsp >>> ecs.openstack.org%2Fopenstack%2Fnova-specs%2Fspecs%2Fpike%2Fapproved% >>> 2Fenable-sriov-nic-features.html&data=02%7C01%7Cedand%40mellanox.com% >>> 7C1e30890a5c7a4f9f045f08d4fcf2b479%7Ca652971c7d2e4d9ba6a4d149256f461b >>> %7C0%7C0%7C636411566714987559&sdata=Ri340H6MLap3HpOMDrb%2FFk6D9agQjEh >>> C9cvR7%2FbV1Ls%3D&reserved=0 >>> --- >>> configure.ac | 13 ++ >>> docs/formatnode.html.in | 1 + >>> src/util/virnetdev.c | 187 >>> +++++++++++++++++++++- >>> src/util/virnetdev.h | 1 + >>> tests/nodedevschemadata/net_00_13_02_b9_f9_d3.xml | 1 + >>> tests/nodedevschemadata/net_00_15_58_2f_e9_55.xml | 1 + >>> 6 files changed, 203 insertions(+), 1 deletion(-) >>> >>> diff --git a/configure.ac b/configure.ac index b12b7fa..c089798 >>> 100644 >>> --- a/configure.ac >>> +++ b/configure.ac >>> @@ -627,6 +627,19 @@ if test "$with_linux" = "yes"; then >>> AC_CHECK_HEADERS([linux/btrfs.h]) fi >>> >>> +dnl >>> +dnl check for kernel headers required by devlink dnl if test >>> +"$with_linux" = "yes"; then >>> + AC_CHECK_HEADERS([linux/devlink.h]) >>> + AC_CHECK_DECLS([DEVLINK_GENL_VERSION, DEVLINK_GENL_NAME, >>> +DEVLINK_ATTR_MAX, DEVLINK_CMD_ESWITCH_GET, DEVLINK_ATTR_BUS_NAME, >>> +DEVLINK_ATTR_DEV_NAME, DEVLINK_ATTR_ESWITCH_MODE, >>> +DEVLINK_ESWITCH_MODE_SWITCHDEV], >> >> That's very ..... thorough, and potentially misleading if someone ever >> wanted to use devlink to check for something other than switchdev (e.g. >> [mythical feature X]) on some system that didn't have the proper stuff >> defined for switchdev, but did have it for [other mythical feature X]. >> But that's all just hypothetical, so this is fine with me. >> >> >>> + [AC_DEFINE([HAVE_DECL_DEVLINK], >>> + [1], >>> + [whether devlink declarations is >>> + available])], >> >> [**] >> s/is/are/ >> > > Yep - altered... > >>> + [], >>> + [[#include <linux/devlink.h>]]) fi >>> + >>> dnl Allow perl/python overrides >>> AC_PATH_PROGS([PYTHON], [python2 python]) if test -z "$PYTHON"; >>> then diff --git a/docs/formatnode.html.in b/docs/formatnode.html.in >>> index 4d935b5..29244a8 100644 >>> --- a/docs/formatnode.html.in >>> +++ b/docs/formatnode.html.in >>> @@ -227,6 +227,7 @@ >>> <dt><code>rxhash</code></dt><dd>receive-hashing</dd> >>> >>> <dt><code>rdma</code></dt><dd>remote-direct-memory-access</dd> >>> >>> <dt><code>txudptnl</code></dt><dd>tx-udp-tunnel-segmentation</dd> >>> + >>> + <dt><code>switchdev</code></dt><dd>kernel-forward-plane-offload</dd >>> + > >> >> pretty odd abbreviation. But it is what it is :-) >>>> </dl> >>> </dd> >>> <dt><code>capability</code></dt> diff --git >>> a/src/util/virnetdev.c b/src/util/virnetdev.c index 51a6e42..fc7c961 >>> 100644 >>> --- a/src/util/virnetdev.c >>> +++ b/src/util/virnetdev.c >>> @@ -59,6 +59,10 @@ >>> # include <net/if_dl.h> >>> #endif >>> >>> +#if HAVE_DECL_DEVLINK >>> +# include <linux/devlink.h> >>> +#endif >>> + >>> #ifndef IFNAMSIZ >>> # define IFNAMSIZ 16 >>> #endif >>> @@ -2481,7 +2485,8 @@ VIR_ENUM_IMPL(virNetDevFeature, >>> "ntuple", >>> "rxhash", >>> "rdma", >>> - "txudptnl") >>> + "txudptnl", >>> + "switchdev") >>> >>> #ifdef __linux__ >>> int >>> @@ -2936,6 +2941,7 @@ int virNetDevGetRxFilter(const char *ifname, >>> return ret; >>> } >>> >>> + >> >> [**] >> Added a spurious extra line. >> > > Removed. > >>> #if defined(SIOCETHTOOL) && defined(HAVE_STRUCT_IFREQ) >>> >>> /** >>> @@ -3115,6 +3121,182 @@ virNetDevGetEthtoolFeatures(virBitmapPtr >>> bitmap, } >>> >>> >>> +#if HAVE_DECL_DEVLINK >>> +/** >>> + * virNetDevPutExtraHeader >>> + * reserve and prepare room for an extra header >>> + * This function sets to zero the room that is required to put the >>> +extra >>> + * header after the initial Netlink header. This function also >>> +increases >>> + * the nlmsg_len field. >>> + * >>> + * @nlh: pointer to Netlink header >>> + * @size: size of the extra header that we want to put >>> + * >>> + * Returns pointer to the start of the extended header */ static >>> +void * virNetDevPutExtraHeader(struct nlmsghdr *nlh, >>> + size_t size) { >>> + char *ptr = (char *)nlh + nlh->nlmsg_len; >>> + size_t len = NLMSG_ALIGN(size); >>> + nlh->nlmsg_len += len; >>> + memset(ptr, 0, len); >> >> [**] >> I'm fairly confident this memset() is unnecessary. The buffer the >> "extra header" is being added to is allocated with >> nlmsg_alloc_simple(); I looked at the source for nlmsg_alloc_simple(), >> and it ends up calling calloc(), which will initialize everything to 0 >> anyway (and any important fields are set to some other value >> immediately after return from virNetDevPutExtraHeader()). The reason >> the extra memset() bothers me is that 1) we don't set memory allocated >> for netlink messages to 0 anywhere else in our netlink-related code so >> it's inconsistent, and 2) having an extra memset(0) when it's not >> needed could end up becoming the start of a "cargo cult" practice of >> calling memset(0) all over the place just because we're paranoid - I'd >> rather avoid extra initialization if it's redundant/superfluous. >> >> I of course wouldn't want this to be pushed with such a change without >> having it tested. I did make this change locally and ran it under gdb >> with a breakpoint on virNetDevPutExtraHeader(), and saw that the >> entire >> 4096 bytes of *nl_msg is initialized to 0; it's up to you whether or >> not you think it's necessary to re-test on a system that has a card >> that support switchdev. >> >> (alternately, John could push it as is, and I could send a separate >> patch removing the memset(), which could be ACKed by Edan after he >> tries it (or NACKed in the event that I'm completely wrong about it >> :-) >> > > I saw this on my initial review and thought perhaps it was superfluous. > Tracing through various code to find the original calloc is impressive! > I didn't originally do anything with it because I didn't think it would hurt, > but I agree with you. > > So I've removed it from my local copy of the patch that I'll use to push on > Monday. Perhaps by then Edan will see these messages and be able to indicate > one way or another whether it should be kept from a "retest" > POV. Besides, if anything breaks in one of the many CI builds, I don't want > to spent weekend hours trying to clean up after it. > > > John > >>> + return ptr; >>> +} >> >> I *was* going to say this doesn't need to be inside HAVE_DECL_DEVLINK, >> but since it's defined as static, and only used by a function that is >> itself inside HAVE_DECL_DEVLINK, this placement is correct. >> >> >>> + >>> + >>> +/** >>> + * virNetDevGetFamilyId: >>> + * This function supplies the devlink family id >>> + * >>> + * @family_name: the name of the family to query >>> + * >>> + * Returns family id or 0 on failure. >>> + */ >>> +static uint32_t >>> +virNetDevGetFamilyId(const char *family_name) { >>> + struct nl_msg *nl_msg = NULL; >>> + struct nlmsghdr *resp = NULL; >>> + struct genlmsghdr* gmsgh = NULL; >>> + struct nlattr *tb[CTRL_ATTR_MAX + 1] = {NULL, }; >>> + unsigned int recvbuflen; >>> + uint32_t family_id = 0; >>> + >>> + if (!(nl_msg = nlmsg_alloc_simple(GENL_ID_CTRL, >>> + NLM_F_REQUEST | NLM_F_ACK))) { >>> + virReportOOMError(); >>> + goto cleanup; >>> + } >>> + >>> + if (!(gmsgh = virNetDevPutExtraHeader(nlmsg_hdr(nl_msg), sizeof(struct >>> genlmsghdr)))) >>> + goto cleanup; >>> + >>> + gmsgh->cmd = CTRL_CMD_GETFAMILY; >>> + gmsgh->version = DEVLINK_GENL_VERSION; >>> + >>> + if (nla_put_string(nl_msg, CTRL_ATTR_FAMILY_NAME, family_name) < 0) { >>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", >>> + _("allocated netlink buffer is too small")); >>> + goto cleanup; >>> + } >>> + >>> + if (virNetlinkCommand(nl_msg, &resp, &recvbuflen, 0, 0, >>> NETLINK_GENERIC, 0) < 0) >>> + goto cleanup; >>> + >>> + if (nlmsg_parse(resp, sizeof(struct nlmsghdr), tb, CTRL_CMD_MAX, NULL) >>> < 0) { >>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", >>> + _("malformed netlink response message")); >>> + goto cleanup; >>> + } >>> + >>> + if (tb[CTRL_ATTR_FAMILY_ID] == NULL) >>> + goto cleanup; >>> + >>> + family_id = *(uint32_t *)RTA_DATA(tb[CTRL_ATTR_FAMILY_ID]); >>> + >>> + cleanup: >>> + nlmsg_free(nl_msg); >>> + VIR_FREE(resp); >>> + return family_id; >>> +} >>> + >>> + >>> +/** >>> + * virNetDevSwitchdevFeature >>> + * This function checks for the availability of Switchdev feature >>> + * and add it to bitmap >>> + * >>> + * @ifname: name of the interface >>> + * @out: add Switchdev feature if exist to bitmap >>> + * >>> + * Returns 0 on success, -1 on failure. >>> + */ >>> +static int >>> +virNetDevSwitchdevFeature(const char *ifname, >>> + virBitmapPtr *out) { >>> + struct nl_msg *nl_msg = NULL; >>> + struct nlmsghdr *resp = NULL; >>> + unsigned int recvbuflen; >>> + struct nlattr *tb[DEVLINK_ATTR_MAX + 1] = {NULL, }; >>> + virPCIDevicePtr pci_device_ptr = NULL; >>> + struct genlmsghdr* gmsgh = NULL; >>> + const char *pci_name; >>> + char *pfname = NULL; >>> + int is_vf = -1; >>> + int ret = -1; >>> + uint32_t family_id; >>> + >>> + if ((family_id = virNetDevGetFamilyId(DEVLINK_GENL_NAME)) <= 0) >>> + return ret; >>> + >>> + if ((is_vf = virNetDevIsVirtualFunction(ifname)) < 0) >>> + return ret; >>> + >>> + if (is_vf == 1 && virNetDevGetPhysicalFunction(ifname, &pfname) < 0) >>> + goto cleanup; >>> + >>> + if (!(nl_msg = nlmsg_alloc_simple(family_id, >>> + NLM_F_REQUEST | NLM_F_ACK))) { >>> + virReportOOMError(); >>> + goto cleanup; >>> + } >>> + >>> + if (!(gmsgh = virNetDevPutExtraHeader(nlmsg_hdr(nl_msg), sizeof(struct >>> genlmsghdr)))) >>> + goto cleanup; >>> + >>> + gmsgh->cmd = DEVLINK_CMD_ESWITCH_GET; >>> + gmsgh->version = DEVLINK_GENL_VERSION; >>> + >>> + pci_device_ptr = pfname ? virNetDevGetPCIDevice(pfname) : >>> + virNetDevGetPCIDevice(ifname); >>> + if (pci_device_ptr == NULL) >>> + goto cleanup; >>> + >>> + pci_name = virPCIDeviceGetName(pci_device_ptr); >>> + >>> + if (nla_put(nl_msg, DEVLINK_ATTR_BUS_NAME, strlen("pci")+1, "pci") < 0 >>> || >>> + nla_put(nl_msg, DEVLINK_ATTR_DEV_NAME, strlen(pci_name)+1, >>> pci_name) < 0) { >>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", >>> + _("allocated netlink buffer is too small")); >>> + goto cleanup; >>> + } >>> + >>> + if (virNetlinkCommand(nl_msg, &resp, &recvbuflen, 0, 0, >>> NETLINK_GENERIC, 0) < 0) >>> + goto cleanup; >>> + >>> + if (nlmsg_parse(resp, sizeof(struct genlmsghdr), tb, DEVLINK_ATTR_MAX, >>> NULL) < 0) { >>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", >>> + _("malformed netlink response message")); >>> + goto cleanup; >>> + } >>> + >>> + if (tb[DEVLINK_ATTR_ESWITCH_MODE] && >>> + *(int *)RTA_DATA(tb[DEVLINK_ATTR_ESWITCH_MODE]) == >>> DEVLINK_ESWITCH_MODE_SWITCHDEV) { >>> + ignore_value(virBitmapSetBit(*out, VIR_NET_DEV_FEAT_SWITCHDEV)); >>> + } >> >> >> I won't pretend that I know all about these particular netlink >> messages/attributes. I can say that the code all looks similar to our >> other netlink code, and when I run it on my host that doesn't have >> cards supporting switchdev, it does execute this code and reports no >> switchdev capability in the nodedev-dumpxml output of my NICs; since >> I'm sure Edan has done the same test on a system that *does* have >> switchdev-capable NICs, I'd say that's adequate testing. >> >> Based on that I give my ACK (with at least the two trivial [**] items >> changed, preferably also memset() removed), or if you prefer: >> >> Reviewed-by: Laine Stump <[email protected]> >> >> >> >>> + >>> + ret = 0; >>> + >>> + cleanup: >>> + nlmsg_free(nl_msg); >>> + virPCIDeviceFree(pci_device_ptr); >>> + VIR_FREE(resp); >>> + VIR_FREE(pfname); >>> + return ret; >>> +} >>> +#else >>> +static int >>> +virNetDevSwitchdevFeature(const char *ifname ATTRIBUTE_UNUSED, >>> + virBitmapPtr *out ATTRIBUTE_UNUSED) { >>> + return 0; >>> +} >>> +#endif >>> + >>> + >>> # if HAVE_DECL_ETHTOOL_GFEATURES >>> /** >>> * virNetDevGFeatureAvailable >>> @@ -3315,6 +3497,9 @@ virNetDevGetFeatures(const char *ifname, >>> if (virNetDevRDMAFeature(ifname, out) < 0) >>> goto cleanup; >>> >>> + if (virNetDevSwitchdevFeature(ifname, out) < 0) >>> + goto cleanup; >>> + >>> ret = 0; >>> cleanup: >>> VIR_FORCE_CLOSE(fd); >>> diff --git a/src/util/virnetdev.h b/src/util/virnetdev.h index >>> 9205c0e..71eaf45 100644 >>> --- a/src/util/virnetdev.h >>> +++ b/src/util/virnetdev.h >>> @@ -112,6 +112,7 @@ typedef enum { >>> VIR_NET_DEV_FEAT_RXHASH, >>> VIR_NET_DEV_FEAT_RDMA, >>> VIR_NET_DEV_FEAT_TXUDPTNL, >>> + VIR_NET_DEV_FEAT_SWITCHDEV, >>> VIR_NET_DEV_FEAT_LAST >>> } virNetDevFeature; >>> >>> diff --git a/tests/nodedevschemadata/net_00_13_02_b9_f9_d3.xml >>> b/tests/nodedevschemadata/net_00_13_02_b9_f9_d3.xml >>> index d4c96e8..88252e6 100644 >>> --- a/tests/nodedevschemadata/net_00_13_02_b9_f9_d3.xml >>> +++ b/tests/nodedevschemadata/net_00_13_02_b9_f9_d3.xml >>> @@ -15,6 +15,7 @@ >>> <feature name='rxhash'/> >>> <feature name='rdma'/> >>> <feature name='txudptnl'/> >>> + <feature name='switchdev'/> >>> <capability type='80211'/> >>> </capability> >>> </device> >>> diff --git a/tests/nodedevschemadata/net_00_15_58_2f_e9_55.xml >>> b/tests/nodedevschemadata/net_00_15_58_2f_e9_55.xml >>> index 71bf90e..f77dfcc 100644 >>> --- a/tests/nodedevschemadata/net_00_15_58_2f_e9_55.xml >>> +++ b/tests/nodedevschemadata/net_00_15_58_2f_e9_55.xml >>> @@ -15,6 +15,7 @@ >>> <feature name='rxhash'/> >>> <feature name='rdma'/> >>> <feature name='txudptnl'/> >>> + <feature name='switchdev'/> >>> <capability type='80203'/> >>> </capability> >>> </device> >> >> -- libvir-list mailing list [email protected] https://www.redhat.com/mailman/listinfo/libvir-list
