> >"Kavanagh, Mark B" <mark.b.kavan...@intel.com> writes: > >>> >>>"Kavanagh, Mark B" <mark.b.kavan...@intel.com> writes: >>>> Further clarifications below Aaron - please let me know if you have >>>> any additional comments. >>>> >>>> Thanks, >>>> Mark >>> >>>Thanks again for this work, Mark! >>> >>>>> >>>>>One addendum below, Aaron. >>>>> >>>>>> >>>>>>>Hi Mark, >>>>>>> >>>>>>>Thanks for the patch! I've not done a very thorough review of this, but >>>>>>>my first-blush comments are inline. >>>>>> >>>>>>Thanks for the review Aaron! >>>>>> >>>>>>Any additional comments are welcome - my initial responses are >> inline, below. >>>>>> >>>>>>Thanks, >>>>>>Mark >>>>>> >>>>>>> >>>>>>>Mark Kavanagh <mark.b.kavan...@intel.com> writes: >>>>>>>> Add support for Jumbo Frames to DPDK-enabled port types, >>>>>>>> using single-segment-mbufs. >>>>>>>> >>>>>>>> Using this approach, the amount of memory allocated for each mbuf >>>>>>>> to store frame data is increased to a value greater than 1518B >>>>>>>> (typical Ethernet maximum frame length). The increased space >>>>>>>> available in the mbuf means that an entire Jumbo Frame can be carried >>>>>>>> in a single mbuf, as opposed to partitioning it across multiple mbuf >>>>>>>> segments. >>>>>>>> >>>>>>>> The amount of space allocated to each mbuf to hold frame data is >>>>>>>> defined dynamically by the user when adding a DPDK port to a bridge. >>>>>>>> If an MTU value is not supplied, or the user-supplied value is invalid, >>>>>>>> the MTU for the port defaults to standard Ethernet MTU (i.e. 1500B). >>>>>>>> >>>>>>>> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >>>>>>>> --- >>>>>>> >>>>>>>This is a new feature which has user-visible impact, so should be >>>>>>>announced in NEWS (plus, it's great news). >>>>>> >>>>>>No problem - I'll add an entry. >>>>>> >>>>>>> >>>>>>>> INSTALL.DPDK.md | 59 ++++++++- >>>>>>>> lib/netdev-dpdk.c | 356 >>>> +++++++++++++++++++++++++++++++++++++++++------------- >>>>>>>> 2 files changed, 328 insertions(+), 87 deletions(-) >>>>>>>> >>>>>>>> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >>>>>>>> index 96b686c..2f23e27 100644 >>>>>>>> --- a/INSTALL.DPDK.md >>>>>>>> +++ b/INSTALL.DPDK.md >>>>>>>> @@ -859,10 +859,61 @@ by adding the following string: >>>>>>>> to <interface> sections of all network devices used by >> DPDK. Parameter 'N' >>>>>>>> determines how many queues can be used by the guest. >>>>>>>> >>>>>>>> + >>>>>>>> +Jumbo Frames >>>>>>>> +------------ >>>>>>>> + >>>>>>>> +Support for Jumbo Frames may be enabled at run-time for DPDK-type >>>>>>>> ports. >>>>>>>> + >>>>>>>> +To avail of Jumbo Frame support, add the '--dpdk-mtu' option to >>>> the ovs-vsctl >>>>>>>> +'add-port' command-line, along with the required MTU for the port. >>>>>>>> +e.g. >>>>>>> >>>>>>>This text and the example do not match. I think '--dpdk-mtu' is not >>>>>>>valid. >>>>>> >>>>>>Good catch, thanks. >>>>>> >>>>>>> >>>>>>>> + >>>>>>>> + ``` >>>>>>>> + ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk >>>> options:dpdk-mtu=9000 >>>>>>>> + ``` >>>>>>>> + >>>>>>>> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf >> segments are >>>>>>>> +increased, such that a full Jumbo Frame may be accommodated >>>> inside a single >>>>>>>> +mbuf segment. Once set, the MTU for a DPDK port is immutable. >>>>>>> >>>>>>>Why no support? DPDK supports changing the mtu. >>>>>> >>>>>>I guess my rationale here is that an MTU change can't be triggered >>>> via OVS command-line, nor >>>>>>can it be triggered programmatically via DPDK (apart from an >> explicit call to >>>>>>rte_eth_dev_set_mtu). >>>>>>So, while technically it's possibly, from a user's point of view, >>>> there's no way to >>>>>configure >>>>>>it, outside of modifying the code directly. If I've missed something >>>> here, please let me >>>>>>know. >>>>>> >>>>>>> >>>>>>>> +Jumbo frame support has been validated against 13312B frames, using >>>>>>>> the >>>>>>>> +DPDK `igb_uio` driver, but larger frames and other DPDK NIC drivers >>>>>>>> may >>>>>>>> +theoretically be supported. Supported port types excludes >>>> vHost-Cuse ports, as >>>>>>>> +this feature is pending deprecation. >>>>>>>> + >>>>>>>> + >>>>>>>> +vHost Ports and Jumbo Frames >>>>>>>> +---------------------------- >>>>>>>> +Jumbo frame support is available for DPDK vHost-User ports >>>> only. Some additional >>>>>>>> +configuration is needed to take advantage of this feature: >>>>>>>> + >>>>>>>> + 1. `mergeable buffers` must be enabled for vHost ports, as >>>> demonstrated in >>>>>>>> + the QEMU command line snippet below: >>>>>>>> + >>>>>>>> + ``` >>>>>>>> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >>>>>>>> + '-device >>>> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >>>>>>>> + ``` >>>>>>>> + >>>>>>>> + 2. Where virtio devices are bound to the Linux kernel driver >> in a guest >>>>>>>> + environment (i.e. interfaces are not bound to an in-guest DPDK >>>> driver), the >>>>>>>> + MTU of those logical network interfaces must also be >> increased. This >>>>>>>> + avoids segmentation of Jumbo Frames in the guest. Note that >>>> 'MTU' refers >>>>>>>> + to the length of the IP packet only, and not that of the entire >>>> frame. >>>>>>>> + >>>>>>>> + e.g. To calculate the exact MTU of a standard IPv4 frame, >>>> subtract the L2 >>>>>>>> + header and CRC lengths (i.e. 18B) from the max supported >> frame size. >>>>>>>> + So, to set the MTU for a 13312B Jumbo Frame: >>>>>>>> + >>>>>>>> + ``` >>>>>>>> + ifconfig eth1 mtu 13294 >>>>>>>> + ``` >>>>>>>> + >>>>>>>> + >>>>>>>> Restrictions: >>>>>>>> ------------- >>>>>>>> >>>>>>>> - - Work with 1500 MTU, needs few changes in DPDK lib to fix this >>>>>>>> issue. >>>>>>>> - Currently DPDK port does not make use any offload functionality. >>>>>>>> - DPDK-vHost support works with 1G huge pages. >>>>>>>> >>>>>>>> @@ -903,6 +954,12 @@ Restrictions: >>>>>>>> the next release of DPDK (which includes the above patch) is >>>> available and >>>>>>>> integrated into OVS. >>>>>>>> >>>>>>>> + Jumbo Frames: >>>>>>>> + - `virtio-pmd`: DPDK apps in the guest do not exit >>>> gracefully. The source of >>>>>>>> + this issue is currently being investigated. >>>>>>>> + - vHost-Cuse: Jumbo Frame support is not available for vHost >> Cuse ports. >>>>>>>> + >>>>>>>> + >>>>>>>> Bug Reporting: >>>>>>>> -------------- >>>>>>>> >>>>>>>> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >>>>>>>> index de7e488..5f23e28 100644 >>>>>>>> --- a/lib/netdev-dpdk.c >>>>>>>> +++ b/lib/netdev-dpdk.c >>>>>>>> @@ -62,20 +62,25 @@ static struct vlog_rate_limit rl = >>>> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>>>> #define OVS_CACHE_LINE_SIZE CACHE_LINE_SIZE >>>>>>>> #define OVS_VPORT_DPDK "ovs_dpdk" >>>>>>>> >>>>>>>> +#define NETDEV_DPDK_JUMBO_FRAME_ENABLED 1 >>>>>>>> +#define NETDEV_DPDK_DEFAULT_RX_BUFSIZE 1024 >>>>>>>> + >>>>>>>> /* >>>>>>>> * need to reserve tons of extra space in the mbufs so we can align >>>>>>>> the >>>>>>>> * DMA addresses to 4KB. >>>>>>>> * The minimum mbuf size is limited to avoid scatter behaviour >> and drop in >>>>>>>> * performance for standard Ethernet MTU. >>>>>>>> */ >>>>>>>> -#define MTU_TO_MAX_LEN(mtu) ((mtu) + ETHER_HDR_LEN + ETHER_CRC_LEN) >>>>>>>> -#define MBUF_SIZE_MTU(mtu) (MTU_TO_MAX_LEN(mtu) \ >>>>>>>> - + sizeof(struct dp_packet) \ >>>>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>>>> -#define MBUF_SIZE_DRIVER (2048 \ >>>>>>>> - + sizeof (struct rte_mbuf) \ >>>>>>>> - + RTE_PKTMBUF_HEADROOM) >>>>>>>> -#define MBUF_SIZE(mtu) MAX(MBUF_SIZE_MTU(mtu), MBUF_SIZE_DRIVER) >>>>>>>> +#define MTU_TO_FRAME_LEN(mtu) ((mtu) + ETHER_HDR_LEN + >>>> ETHER_CRC_LEN) >>>>>>>> +#define FRAME_LEN_TO_MTU(frame_len) ((frame_len)- ETHER_HDR_LEN >>>> - ETHER_CRC_LEN) >>>>>>>> +#define MBUF_SEGMENT_SIZE(mtu) ( MTU_TO_FRAME_LEN(mtu) \ >>>>>>>> + + sizeof(struct dp_packet) \ >>>>>>>> + + RTE_PKTMBUF_HEADROOM) >>>>>>>> + >>>>>>>> +/* This value should be specified as a multiple of the DPDK NIC >>>>>>>> driver's >>>>>>>> + * 'min_rx_bufsize' attribute (currently 1024B for 'igb_uio'). >>>>>>>> + */ >>>>>>>> +#define NETDEV_DPDK_MAX_FRAME_LEN 13312 >>>>>>>> >>>>>>>> /* Max and min number of packets in the mempool. OVS tries to >> allocate a >>>>>>>> * mempool with MAX_NB_MBUF: if this fails (because the system >>>> doesn't have >>>>>>>> @@ -86,6 +91,8 @@ static struct vlog_rate_limit rl = >>>> VLOG_RATE_LIMIT_INIT(5, 20); >>>>>>>> #define MIN_NB_MBUF (4096 * 4) >>>>>>>> #define MP_CACHE_SZ RTE_MEMPOOL_CACHE_MAX_SIZE >>>>>>>> >>>>>>>> +#define DPDK_VLAN_TAG_LEN 4 >>>>>>>> + >>>>>>>> /* MAX_NB_MBUF can be divided by 2 many times, until MIN_NB_MBUF */ >>>>>>>> BUILD_ASSERT_DECL(MAX_NB_MBUF % >>>> ROUND_DOWN_POW2(MAX_NB_MBUF/MIN_NB_MBUF) == 0); >>>>>>>> >>>>>>>> @@ -114,7 +121,6 @@ static const struct rte_eth_conf port_conf = { >>>>>>>> .header_split = 0, /* Header Split disabled */ >>>>>>>> .hw_ip_checksum = 0, /* IP checksum offload disabled */ >>>>>>>> .hw_vlan_filter = 0, /* VLAN filtering disabled */ >>>>>>>> - .jumbo_frame = 0, /* Jumbo Frame Support disabled */ >>>>>>>> .hw_strip_crc = 0, >>>>>>>> }, >>>>>>>> .rx_adv_conf = { >>>>>>>> @@ -254,6 +260,39 @@ is_dpdk_class(const struct netdev_class *class) >>>>>>>> return class->construct == netdev_dpdk_construct; >>>>>>>> } >>>>>>>> >>>>>>>> +/* DPDK NIC drivers allocate RX buffers at a particular granularity >>>>>>>> + * (specified by rte_eth_dev_info.min_rx_bufsize - currently 1K >>>> for igb_uio). >>>>>>>> + * If 'frame_len' is not a multiple of this value, insufficient >>>> buffers are >>>>>>>> + * allocated to accomodate the packet in its >>>> entirety. Furthermore, the igb_uio >>>>>>>> + * driver needs to ensure that there is also sufficient space in >>>> the Rx buffer >>>>>>>> + * to accommodate two VLAN tags (for QinQ frames). If the RX >> buffer is too >>>>>>>> + * small, then the driver enables scatter RX behaviour, which reduces >>>>>>>> + * performance. To prevent this, use a buffer size that is closest to >>>>>>>> + * 'frame_len', but which satisfies the aforementioned criteria. >>>>>>>> + */ >>>>>>>> +static uint32_t >>>>>>>> +dpdk_buf_size(struct netdev_dpdk *netdev, int frame_len) >>>>>>>> +{ >>>>>>>> + struct rte_eth_dev_info info; >>>>>>>> + uint32_t buf_size; >>>>>>>> + uint32_t len = frame_len + (2 * DPDK_VLAN_TAG_LEN); >>>>>>>> + >>>>>>>> + if(netdev->type == DPDK_DEV_ETH) { >>>>>>>> + rte_eth_dev_info_get(netdev->port_id, &info); >>>>>>>> + buf_size = (info.min_rx_bufsize == 0) ? >>>>>>>> + NETDEV_DPDK_DEFAULT_RX_BUFSIZE : >>>>>>>> + info.min_rx_bufsize; >>>>>>> >>>>>>>Why not use the rte_eth_dev_get_mtu call to get the port configured mtu? >>>>>> >>>>>>I'm not sure if I follow - the MTU isn't the issue here, but rather >>>> the buffer size used >>>>>>when creating the rte_mempool, which must be a multiple of the >>>> driver's min_rx_bufsize. >>>>>> >>>>>>> >>>>>>>> + } else { >>>>>>>> + buf_size = NETDEV_DPDK_DEFAULT_RX_BUFSIZE; >>>>>>>> + } >>>>>>>> + >>>>>>>> + if(len % buf_size) { >>>>>>>> + len = buf_size * ((len/buf_size) + 1); >>>>>>>> + } >>>>>>>> + >>>>>>>> + return len; >>>>>>>> +} >>>>>>>> + >>>>>>>> /* XXX: use dpdk malloc for entire OVS. in fact huge page should be >>>>>>>> used >>>>>>>> * for all other segments data, bss and text. */ >>>>>>>> >>>>>>>> @@ -280,31 +319,70 @@ free_dpdk_buf(struct dp_packet *p) >>>>>>>> } >>>>>>>> >>>>>>>> static void >>>>>>>> -__rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>>> - void *opaque_arg OVS_UNUSED, >>>>>>>> - void *_m, >>>>>>>> - unsigned i OVS_UNUSED) >>>>>>>> +ovs_rte_pktmbuf_pool_init(struct rte_mempool *mp, void *opaque_arg) >>>>>>>> { >>>>>>>> - struct rte_mbuf *m = _m; >>>>>>>> - uint32_t buf_len = mp->elt_size - sizeof(struct dp_packet); >>>>>>>> + struct rte_pktmbuf_pool_private *user_mbp_priv, *mbp_priv; >>>>>>>> + struct rte_pktmbuf_pool_private default_mbp_priv; >>>>>>>> + uint16_t roomsz; >>>>>>>> >>>>>>>> RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet)); >>>>>>>> >>>>>>>> - memset(m, 0, mp->elt_size); >>>>>>>> + /* if no structure is provided, assume no mbuf private area */ >>>>>>>> + >>>>>>>> + user_mbp_priv = opaque_arg; >>>>>>>> + if (user_mbp_priv == NULL) { >>>>>>>> + default_mbp_priv.mbuf_priv_size = 0; >>>>>>>> + if (mp->elt_size > sizeof(struct dp_packet)) { >>>>>>>> + roomsz = mp->elt_size - sizeof(struct dp_packet); >>>>>>>> + } else { >>>>>>>> + roomsz = 0; >>>>>>>> + } >>>>>>>> + default_mbp_priv.mbuf_data_room_size = roomsz; >>>>>>>> + user_mbp_priv = &default_mbp_priv; >>>>>>>> + } >>>>>>>> + >>>>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= sizeof(struct dp_packet) + >>>>>>>> + user_mbp_priv->mbuf_data_room_size + >>>>>>>> + user_mbp_priv->mbuf_priv_size); >>>>>>>> + >>>>>>>> + mbp_priv = rte_mempool_get_priv(mp); >>>>>>>> + memcpy(mbp_priv, user_mbp_priv, sizeof(*mbp_priv)); >>>>>>>> +} >>>>>>>> + >>>>>>>> +/* Initialise some fields in the mbuf structure that are not >>>> modified by the >>>>>>>> + * user once created (origin pool, buffer start address, etc.*/ >>>>>>>> +static void >>>>>>>> +__ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>>> + void *opaque_arg OVS_UNUSED, >>>>>>>> + void *_m, >>>>>>>> + unsigned i OVS_UNUSED) >>>>>>>> +{ >>>>>>>> + struct rte_mbuf *m = _m; >>>>>>>> + uint32_t buf_size, buf_len, priv_size; >>>>>>>> + >>>>>>>> + priv_size = rte_pktmbuf_priv_size(mp); >>>>>>>> + buf_size = sizeof(struct dp_packet) + priv_size; >>>>>>>> + buf_len = rte_pktmbuf_data_room_size(mp); >>>>>>>> >>>>>>>> - /* start of buffer is just after mbuf structure */ >>>>>>>> - m->buf_addr = (char *)m + sizeof(struct dp_packet); >>>>>>>> - m->buf_physaddr = rte_mempool_virt2phy(mp, m) + >>>>>>>> - sizeof(struct dp_packet); >>>>>>>> - m->buf_len = (uint16_t)buf_len; >>>>>>>> + RTE_MBUF_ASSERT(RTE_ALIGN(priv_size, RTE_MBUF_PRIV_ALIGN) == >> priv_size); >>>>>>>> + RTE_MBUF_ASSERT(mp->elt_size >= buf_size); >>>>>>>> + RTE_MBUF_ASSERT(buf_len <= UINT16_MAX); >>>>>>>> >>>>>>>> - /* keep some headroom between start of buffer and data */ >>>>>>>> - m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, m->buf_len); >>>>>>>> + memset(m, 0, mp->elt_size); >>>>>>>> >>>>>>>> - /* init some constant fields */ >>>>>>>> - m->pool = mp; >>>>>>>> - m->nb_segs = 1; >>>>>>>> - m->port = 0xff; >>>>>>>> + /* start of buffer is after dp_packet structure and priv data */ >>>>>>>> + m->priv_size = priv_size; >>>>>>>> + m->buf_addr = (char *)m + buf_size; >>>>>>>> + m->buf_physaddr = rte_mempool_virt2phy(mp, m) + buf_size; >>>>>>>> + m->buf_len = (uint16_t)buf_len; >>>>>>>> + >>>>>>>> + /* keep some headroom between start of buffer and data */ >>>>>>>> + m->data_off = RTE_MIN(RTE_PKTMBUF_HEADROOM, >> (uint16_t)m->buf_len); >>>>>>>> + >>>>>>>> + /* init some constant fields */ >>>>>>>> + m->pool = mp; >>>>>>>> + m->nb_segs = 1; >>>>>>>> + m->port = 0xff; >>>>>>> >>>>>>>Please don't mix tabs and spaces in the file. >>>>>> >>>>>>Apologies - this was a copy/paste from a DPDK file, which uses tabs. >>>>>> >>>>>>> >>>>>>>> } >>>>>>>> >>>>>>>> static void >>>>>>>> @@ -315,7 +393,7 @@ ovs_rte_pktmbuf_init(struct rte_mempool *mp, >>>>>>>> { >>>>>>>> struct rte_mbuf *m = _m; >>>>>>>> >>>>>>>> - __rte_pktmbuf_init(mp, opaque_arg, _m, i); >>>>>>>> + __ovs_rte_pktmbuf_init(mp, opaque_arg, m, i); >>>>>>>> >>>>>>>> dp_packet_init_dpdk((struct dp_packet *) m, m->buf_len); >>>>>>>> } >>>>>>>> @@ -326,6 +404,7 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>>>>>> struct dpdk_mp *dmp = NULL; >>>>>>>> char mp_name[RTE_MEMPOOL_NAMESIZE]; >>>>>>>> unsigned mp_size; >>>>>>>> + struct rte_pktmbuf_pool_private mbp_priv; >>>>>>>> >>>>>>>> LIST_FOR_EACH (dmp, list_node, &dpdk_mp_list) { >>>>>>>> if (dmp->socket_id == socket_id && dmp->mtu == mtu) { >>>>>>>> @@ -338,6 +417,8 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>>>>>> dmp->socket_id = socket_id; >>>>>>>> dmp->mtu = mtu; >>>>>>>> dmp->refcount = 1; >>>>>>>> + mbp_priv.mbuf_data_room_size = MTU_TO_FRAME_LEN(mtu) + >>>> RTE_PKTMBUF_HEADROOM; >>>>>>>> + mbp_priv.mbuf_priv_size = 0; >>>>>>>> >>>>>>>> mp_size = MAX_NB_MBUF; >>>>>>>> do { >>>>>>>> @@ -346,10 +427,10 @@ dpdk_mp_get(int socket_id, int mtu) >>>> OVS_REQUIRES(dpdk_mutex) >>>>>>>> return NULL; >>>>>>>> } >>>>>>>> >>>>>>>> - dmp->mp = rte_mempool_create(mp_name, mp_size, MBUF_SIZE(mtu), >>>>>>>> + dmp->mp = rte_mempool_create(mp_name, mp_size, >>>> MBUF_SEGMENT_SIZE(mtu), >>>>>>>> MP_CACHE_SZ, >>>>>>>> sizeof(struct rte_pktmbuf_pool_private), >>>>>>>> - rte_pktmbuf_pool_init, NULL, >>>>>>>> + ovs_rte_pktmbuf_pool_init, &mbp_priv, >>>>>>>> ovs_rte_pktmbuf_init, NULL, >>>>>>>> socket_id, 0); >>>>>>>> } while (!dmp->mp && rte_errno == ENOMEM && (mp_size /= 2) >= >>>> MIN_NB_MBUF); >>>>>>>> @@ -433,6 +514,7 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >>>> *dev, int n_rxq, int >>>>>>n_txq) >>>>>>>> { >>>>>>>> int diag = 0; >>>>>>>> int i; >>>>>>>> + struct rte_eth_conf conf = port_conf; >>>>>>>> >>>>>>>> /* A device may report more queues than it makes available (this >>>>>>>> has >>>>>>>> * been observed for Intel xl710, which reserves some of them for >>>>>>>> @@ -444,7 +526,12 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk >>>> *dev, int n_rxq, int >>>>>>>n_txq) >>>>>>>> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); >>>>>>>> } >>>>>>>> >>>>>>>> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>>> &port_conf); >>>>>>>> + if(OVS_UNLIKELY(dev->mtu > ETHER_MTU)) { >>>>>>>> + conf.rxmode.jumbo_frame = NETDEV_DPDK_JUMBO_FRAME_ENABLED; >>>>>>>> + conf.rxmode.max_rx_pkt_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>>> + } >>>>>>>> + >>>>>>>> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >>>>>>>> &conf); >>>>>>>> if (diag) { >>>>>>>> break; >>>>>>>> } >>>>>>>> @@ -586,6 +673,7 @@ netdev_dpdk_init(struct netdev *netdev_, >>>> unsigned int port_no, >>>>>>>> struct netdev_dpdk *netdev = netdev_dpdk_cast(netdev_); >>>>>>>> int sid; >>>>>>>> int err = 0; >>>>>>>> + uint32_t buf_size; >>>>>>>> >>>>>>>> ovs_mutex_init(&netdev->mutex); >>>>>>>> ovs_mutex_lock(&netdev->mutex); >>>>>>>> @@ -605,10 +693,16 @@ netdev_dpdk_init(struct netdev *netdev_, >>>> unsigned int port_no, >>>>>>>> netdev->port_id = port_no; >>>>>>>> netdev->type = type; >>>>>>>> netdev->flags = 0; >>>>>>>> + >>>>>>>> + /* Initialize port's MTU and frame len to the default >> Ethernet values. >>>>>>>> + * Larger, user-specified (jumbo) frame buffers are accommodated >>>>>>>> in >>>>>>>> + * netdev_dpdk_set_config. >>>>>>>> + */ >>>>>>>> + netdev->max_packet_len = ETHER_MAX_LEN; >>>>>>>> netdev->mtu = ETHER_MTU; >>>>>>>> - netdev->max_packet_len = MTU_TO_MAX_LEN(netdev->mtu); >>>>>>>> >>>>>>>> - netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, netdev->mtu); >>>>>>>> + buf_size = dpdk_buf_size(netdev, ETHER_MAX_LEN); >>>>>>>> + netdev->dpdk_mp = dpdk_mp_get(netdev->socket_id, >>>> FRAME_LEN_TO_MTU(buf_size)); >>>>>>>> if (!netdev->dpdk_mp) { >>>>>>>> err = ENOMEM; >>>>>>>> goto unlock; >>>>>>>> @@ -651,6 +745,24 @@ dpdk_dev_parse_name(const char dev_name[], >>>> const char prefix[], >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> +static void >>>>>>>> +dpdk_dev_parse_mtu(const struct smap *args, int *mtu) >>>>>>>> +{ >>>>>>>> + const char *mtu_str = smap_get(args, "dpdk-mtu"); >>>>>>>> + int local_mtu; >>>>>>>> + >>>>>>>> + if(mtu_str) { >>>>>>>> + local_mtu = atoi(mtu_str); >>>>>>> >>>>>>>Please use strtol or strtoul here, and detect errors in the string by >>>>>>>checking endptr. That way, we can be sure that random garbage on the >>>>>>>stack won't accidentally become the mtu. >>>>>> >>>>>>Will do, thanks. >>>>>> >>>>>>> >>>>>>>> + } >>>>>>>> + if(!mtu_str || local_mtu < ETHER_MTU || >>>>>>>> + local_mtu > FRAME_LEN_TO_MTU(NETDEV_DPDK_MAX_FRAME_LEN)) { >>>>>>>> + local_mtu = ETHER_MTU; >>>>>>>> + VLOG_WARN("Invalid or missing dpdk-mtu parameter - defaulting >>>> to %d.\n", >>>>>>>> + local_mtu); >>>>>>>> + } >>>>>>>> + *mtu = local_mtu; >>>>>>>> +} >>>>>>>> + >>>>>>>> static int >>>>>>>> vhost_construct_helper(struct netdev *netdev_) >>>>>>>> OVS_REQUIRES(dpdk_mutex) >>>>>>>> { >>>>>>>> @@ -777,11 +889,77 @@ netdev_dpdk_get_config(const struct netdev >>>> *netdev_, struct smap >>>>>>>*args) >>>>>>>> smap_add_format(args, "configured_rx_queues", "%d", >>>>>>>> netdev_->n_rxq); >>>>>>>> smap_add_format(args, "requested_tx_queues", "%d", >>>>>>>> netdev_->n_txq); >>>>>>>> smap_add_format(args, "configured_tx_queues", "%d", >> dev->real_n_txq); >>>>>>>> + smap_add_format(args, "mtu", "%d", dev->mtu); >>>>>>>> ovs_mutex_unlock(&dev->mutex); >>>>>>>> >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> +/* Set the mtu of DPDK_DEV_ETH ports */ >>>>>>>> +static int >>>>>>>> +netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>>>> +{ >>>>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>>> + int old_mtu, err; >>>>>>>> + uint32_t buf_size; >>>>>>>> + int dpdk_mtu; >>>>>>>> + struct dpdk_mp *old_mp; >>>>>>>> + struct dpdk_mp *mp; >>>>>>>> + >>>>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>>>> + if (dev->mtu == mtu) { >>>>>>>> + err = 0; >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + buf_size = dpdk_buf_size(dev, MTU_TO_FRAME_LEN(mtu)); >>>>>>>> + dpdk_mtu = FRAME_LEN_TO_MTU(buf_size); >>>>>>>> + >>>>>>>> + mp = dpdk_mp_get(dev->socket_id, dpdk_mtu); >>>>>>>> + if (!mp) { >>>>>>>> + err = ENOMEM; >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + rte_eth_dev_stop(dev->port_id); >>>>>>>> + >>>>>>>> + old_mtu = dev->mtu; >>>>>>>> + old_mp = dev->dpdk_mp; >>>>>>>> + dev->dpdk_mp = mp; >>>>>>>> + dev->mtu = mtu; >>>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>>> + >>>>>>>> + err = dpdk_eth_dev_init(dev); >>>>>>> >>>>>>>Why call the dpdk_eth_dev_init here? Since you are directly calling >>>>>>>rte_eth_dev_X functions here - wouldn't it make sense to use the >>>>>>>rte_eth_dev_start() instead? >>>> >>>> The rx queues for the device need to be assigned the mempool that was >>>> allocated for the jumbo frames; since dpdk_eth_dev_init handles the >>>> queue setup, rte_eth_dev configuration, etc. it makes sense to call it >>>> here. >>>> >>>>>> >>>>>>This is legacy code, but I can refactor if needs be. >>>>>> >>>>>>> >>>>>>>> + if (err) { >>>>>>>> + VLOG_WARN("Unable to set MTU '%d' for '%s'; reverting to last >>>> known " >>>>>>>> + "good value '%d'\n", mtu, dev->up.name, old_mtu); >>>>>>>> + dpdk_mp_put(mp); >>>>>>>> + dev->mtu = old_mtu; >>>>>>>> + dev->dpdk_mp = old_mp; >>>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>>> + err = dpdk_eth_dev_init(dev); >>>>>>> >>>>>>>What happens if this returns a failure? By using dpdk_eth_dev_init, you >>>>>>>will cause a queue reconfiguration action, which is not needed, and can >>>>>>>error. >>>> >>>> This call is needed, in case the eth_dev's queue configuration was >>>> changed by the previous call. >>>> The error handling for this call is arguably outside the scope of this >>>> patch, and could be submitted as a separate patch - what do you think? >>> >>>That's probably a good idea. >>> >>>>>>> >>>>>> >>>>>>As above. >>>>>> >>>>>>>If you are intending to set MTU, I would have expected calls to >>>>>>>rte_eth_dev_set_mtu. However, I don't see any. Is there a reason? I >>>>>>>recognize that much of this is legacy, but since you are touching it :) >>>>>> >>>>>>I actually had a number of issues with rte_eth_dev_set_mtu early on >>>> in the patch; I can take >>>>>>a second look now that the code is more stable though. >>>>> >>>>>I remembered why I didn't use this. >>>>> >>>>>If you look at the IXGBE driver's mtu_set function, it refuses frame >>>> lengths which are larger >>>>>than the eth_dev's min_rx_buf_size (- RTE_PKTMBUF_HEADROOM). >>>>>This might be avoided by configuring the device for scatter RX >>>> behavior, but this incurs a >>>>>performance impact. >>> >>>It seems strange to me that we need to ignore the library API for this >>>one case, right? What if the user is only doing vhostuser PMDs mixed >>>with a bnx2x card? >>> >> >> I can't say that I've familiar enough with this setup to form an >> answer on that one Aaron. >> >>>It would be better to get it fixed in the library, I think. After all, >>>they gave a generic setting to change MTU - why don't they make it >>>behave well? >>> >> >> Let me follow up on that with some local DPDK folks here. >>
Just a quick follow up on this Aaron: I did implement a version of this patch that used rte_eth_dev_set_mtu; however, not all DPDK drivers yet implement that function (e.g. i40e), so I reverted to the previous version. DPDK team did mention that the _set_mtu function for i40e should be implemented 'soon', but no dates were specified. See this thread on the DPDK ML for more details: http://dpdk.org/ml/archives/dev/2016-February/032910.html >> Would you consider this a showstopper for upstreaming of this feature? >> If we need to wait for a change to DPDK, Jumbo Frames will miss the >> 2.5 window. > >I don't think it would be a show stopper as long as other devices wouldn't >be negatively impacted, but it's a place where a /*XXX: This is a >workaround for dpdk*/ comment should be included so that we can fix it >when/if it is updated in the library. What do you think? > >>>>><snippet> >>>>>ixgbe_dev_mtu_set(...) { >>>>> >>>>>... >>>>> >>>>> /* refuse mtu that requires the support of scattered packets when this >>>>> * feature has not been enabled before. */ >>>>> if (!dev->data->scattered_rx && >>>>> (frame_size + 2 * IXGBE_VLAN_TAG_SIZE > >>>>> dev->data->min_rx_buf_size - RTE_PKTMBUF_HEADROOM)) >>>>> return -EINVAL; >>>>> >>>>>... >>>>>} >>>>></snippet> >>>>> >>>>> >>>>>> >>>>>>> >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + dpdk_mp_put(old_mp); >>>>>>>> + netdev_change_seq_changed(netdev); >>>>>>>> +out: >>>>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>>>> + return err; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int >>>>>>>> +netdev_dpdk_set_config(struct netdev *netdev_, const struct smap >>>>>>>> *args) >>>>>>>> +{ >>>>>>>> + int mtu; >>>>>>>> + >>>>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>>>> + >>>>>>>> + return netdev_dpdk_set_mtu(netdev_, mtu); >>>>>>>> +} >>>>>>>> + >>>>>>>> static int >>>>>>>> netdev_dpdk_get_numa_id(const struct netdev *netdev_) >>>>>>>> { >>>>>>>> @@ -1358,54 +1536,6 @@ netdev_dpdk_get_mtu(const struct netdev >>>> *netdev, int *mtup) >>>>>>>> >>>>>>>> return 0; >>>>>>>> } >>>>>>>> - >>>>>>>> -static int >>>>>>>> -netdev_dpdk_set_mtu(const struct netdev *netdev, int mtu) >>>>>>>> -{ >>>>>>>> - struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>>> - int old_mtu, err; >>>>>>>> - struct dpdk_mp *old_mp; >>>>>>>> - struct dpdk_mp *mp; >>>>>>>> - >>>>>>>> - ovs_mutex_lock(&dpdk_mutex); >>>>>>>> - ovs_mutex_lock(&dev->mutex); >>>>>>>> - if (dev->mtu == mtu) { >>>>>>>> - err = 0; >>>>>>>> - goto out; >>>>>>>> - } >>>>>>>> - >>>>>>>> - mp = dpdk_mp_get(dev->socket_id, dev->mtu); >>>>>>>> - if (!mp) { >>>>>>>> - err = ENOMEM; >>>>>>>> - goto out; >>>>>>>> - } >>>>>>>> - >>>>>>>> - rte_eth_dev_stop(dev->port_id); >>>>>>>> - >>>>>>>> - old_mtu = dev->mtu; >>>>>>>> - old_mp = dev->dpdk_mp; >>>>>>>> - dev->dpdk_mp = mp; >>>>>>>> - dev->mtu = mtu; >>>>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>>>> - >>>>>>>> - err = dpdk_eth_dev_init(dev); >>>>>>>> - if (err) { >>>>>>>> - dpdk_mp_put(mp); >>>>>>>> - dev->mtu = old_mtu; >>>>>>>> - dev->dpdk_mp = old_mp; >>>>>>>> - dev->max_packet_len = MTU_TO_MAX_LEN(dev->mtu); >>>>>>>> - dpdk_eth_dev_init(dev); >>>>>>>> - goto out; >>>>>>>> - } >>>>>>>> - >>>>>>>> - dpdk_mp_put(old_mp); >>>>>>>> - netdev_change_seq_changed(netdev); >>>>>>>> -out: >>>>>>>> - ovs_mutex_unlock(&dev->mutex); >>>>>>>> - ovs_mutex_unlock(&dpdk_mutex); >>>>>>>> - return err; >>>>>>>> -} >>>>>>>> - >>>>>>>> static int >>>>>>>> netdev_dpdk_get_carrier(const struct netdev *netdev_, bool *carrier); >>>>>>>> >>>>>>>> @@ -1682,7 +1812,7 @@ netdev_dpdk_get_status(const struct netdev >>>> *netdev_, struct smap >>>>>>>*args) >>>>>>>> smap_add_format(args, "numa_id", "%d", >>>> rte_eth_dev_socket_id(dev->port_id)); >>>>>>>> smap_add_format(args, "driver_name", "%s", dev_info.driver_name); >>>>>>>> smap_add_format(args, "min_rx_bufsize", "%u", >>>> dev_info.min_rx_bufsize); >>>>>>>> - smap_add_format(args, "max_rx_pktlen", "%u", >> dev_info.max_rx_pktlen); >>>>>>>> + smap_add_format(args, "max_rx_pktlen", "%u", dev->max_packet_len); >>>>>>>> smap_add_format(args, "max_rx_queues", "%u", >> dev_info.max_rx_queues); >>>>>>>> smap_add_format(args, "max_tx_queues", "%u", >> dev_info.max_tx_queues); >>>>>>>> smap_add_format(args, "max_mac_addrs", "%u", >> dev_info.max_mac_addrs); >>>>>>>> @@ -1904,6 +2034,51 @@ dpdk_vhost_user_class_init(void) >>>>>>>> return 0; >>>>>>>> } >>>>>>>> >>>>>>>> +/* Set the mtu of DPDK_DEV_VHOST ports */ >>>>>>>> +static int >>>>>>>> +netdev_dpdk_vhost_set_mtu(const struct netdev *netdev, int mtu) >>>>>>>> +{ >>>>>>>> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >>>>>>>> + int err = 0; >>>>>>>> + struct dpdk_mp *old_mp; >>>>>>>> + struct dpdk_mp *mp; >>>>>>>> + >>>>>>>> + ovs_mutex_lock(&dpdk_mutex); >>>>>>>> + ovs_mutex_lock(&dev->mutex); >>>>>>>> + if (dev->mtu == mtu) { >>>>>>>> + err = 0; >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + mp = dpdk_mp_get(dev->socket_id, mtu); >>>>>>>> + if (!mp) { >>>>>>>> + err = ENOMEM; >>>>>>>> + goto out; >>>>>>>> + } >>>>>>>> + >>>>>>>> + old_mp = dev->dpdk_mp; >>>>>>>> + dev->dpdk_mp = mp; >>>>>>>> + dev->mtu = mtu; >>>>>>>> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >>>>>>>> + >>>>>>>> + dpdk_mp_put(old_mp); >>>>>>>> + netdev_change_seq_changed(netdev); >>>>>>>> +out: >>>>>>>> + ovs_mutex_unlock(&dev->mutex); >>>>>>>> + ovs_mutex_unlock(&dpdk_mutex); >>>>>>>> + return err; >>>>>>>> +} >>>>>>>> + >>>>>>>> +static int >>>>>>>> +netdev_dpdk_vhost_set_config(struct netdev *netdev_, const >>>> struct smap *args) >>>>>>>> +{ >>>>>>>> + int mtu; >>>>>>>> + >>>>>>>> + dpdk_dev_parse_mtu(args, &mtu); >>>>>>>> + >>>>>>>> + return netdev_dpdk_vhost_set_mtu(netdev_, mtu); >>>>>>>> +} >>>>>>>> + >>>>>>>> static void >>>>>>>> dpdk_common_init(void) >>>>>>>> { >>>>>>>> @@ -2040,8 +2215,9 @@ unlock_dpdk: >>>>>>>> return err; >>>>>>>> } >>>>>>>> >>>>>>>> -#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, >> MULTIQ, SEND, \ >>>>>>>> - GET_CARRIER, GET_STATS, GET_FEATURES, GET_STATUS, RXQ_RECV) \ >>>>>>>> +#define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, >>>>>>>> SET_CONFIG, \ >>>>>>>> + MULTIQ, SEND, SET_MTU, GET_CARRIER, GET_STATS, GET_FEATURES, >>>>>>>> \ >>>>>>>> + GET_STATUS, RXQ_RECV) >>>>>>>> \ >>>>>>>> { \ >>>>>>>> NAME, \ >>>>>>>> INIT, /* init */ \ >>>>>>>> @@ -2053,7 +2229,7 @@ unlock_dpdk: >>>>>>>> DESTRUCT, \ >>>>>>>> netdev_dpdk_dealloc, \ >>>>>>>> netdev_dpdk_get_config, \ >>>>>>>> - NULL, /* netdev_dpdk_set_config */ \ >>>>>>>> + SET_CONFIG, \ >>>>>>>> NULL, /* get_tunnel_config */ \ >>>>>>>> NULL, /* build header */ \ >>>>>>>> NULL, /* push header */ \ >>>>>>>> @@ -2067,7 +2243,7 @@ unlock_dpdk: >>>>>>>> netdev_dpdk_set_etheraddr, \ >>>>>>>> netdev_dpdk_get_etheraddr, \ >>>>>>>> netdev_dpdk_get_mtu, \ >>>>>>>> - netdev_dpdk_set_mtu, \ >>>>>>>> + SET_MTU, \ >>>>>>>> netdev_dpdk_get_ifindex, \ >>>>>>>> GET_CARRIER, \ >>>>>>>> netdev_dpdk_get_carrier_resets, \ >>>>>>>> @@ -2213,8 +2389,10 @@ static const struct netdev_class dpdk_class = >>>>>>>> NULL, >>>>>>>> netdev_dpdk_construct, >>>>>>>> netdev_dpdk_destruct, >>>>>>>> + netdev_dpdk_set_config, >>>>>>>> netdev_dpdk_set_multiq, >>>>>>>> netdev_dpdk_eth_send, >>>>>>>> + netdev_dpdk_set_mtu, >>>>>>>> netdev_dpdk_get_carrier, >>>>>>>> netdev_dpdk_get_stats, >>>>>>>> netdev_dpdk_get_features, >>>>>>>> @@ -2227,8 +2405,10 @@ static const struct netdev_class >>>>>>>> dpdk_ring_class = >>>>>>>> NULL, >>>>>>>> netdev_dpdk_ring_construct, >>>>>>>> netdev_dpdk_destruct, >>>>>>>> + netdev_dpdk_set_config, >>>>>>>> netdev_dpdk_set_multiq, >>>>>>>> netdev_dpdk_ring_send, >>>>>>>> + netdev_dpdk_set_mtu, >>>>>>>> netdev_dpdk_get_carrier, >>>>>>>> netdev_dpdk_get_stats, >>>>>>>> netdev_dpdk_get_features, >>>>>>>> @@ -2241,8 +2421,10 @@ static const struct netdev_class >>>> OVS_UNUSED dpdk_vhost_cuse_class >>>>>= >>>>>>>> dpdk_vhost_cuse_class_init, >>>>>>>> netdev_dpdk_vhost_cuse_construct, >>>>>>>> netdev_dpdk_vhost_destruct, >>>>>>>> + NULL, >>>>>>>> netdev_dpdk_vhost_set_multiq, >>>>>>>> netdev_dpdk_vhost_send, >>>>>>>> + NULL, >>>>>>>> netdev_dpdk_vhost_get_carrier, >>>>>>>> netdev_dpdk_vhost_get_stats, >>>>>>>> NULL, >>>>>>>> @@ -2255,8 +2437,10 @@ static const struct netdev_class >>>> OVS_UNUSED dpdk_vhost_user_class >>>>>= >>>>>>>> dpdk_vhost_user_class_init, >>>>>>>> netdev_dpdk_vhost_user_construct, >>>>>>>> netdev_dpdk_vhost_destruct, >>>>>>>> + netdev_dpdk_vhost_set_config, >>>>>>>> netdev_dpdk_vhost_set_multiq, >>>>>>>> netdev_dpdk_vhost_send, >>>>>>>> + netdev_dpdk_vhost_set_mtu, >>>>>>>> netdev_dpdk_vhost_get_carrier, >>>>>>>> netdev_dpdk_vhost_get_stats, >>>>>>>> NULL, >>>>>>_______________________________________________ >>>>>>dev mailing list >>>>>>dev@openvswitch.org >>>>>>http://openvswitch.org/mailman/listinfo/dev >>>>>_______________________________________________ >>>>>dev mailing list >>>>>dev@openvswitch.org >>>>>http://openvswitch.org/mailman/listinfo/dev >>><#secure method=pgpmime mode=encrypt> _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev