> >Hi Daniele. Thanks for posting this. Hi Ilya,
I actually implemented this patch as part of Daniele's MTU patchset, based on my earlier patch - Daniele mainly rebased it to head of master :) Thanks for your feedback - I've responded inline. Cheers, Mark >I have almost same patch in my local branch. > >I didn't test this with physical DPDK NICs yet, but I have few >high level comments: > >1. Do you thought about renaming of 'mtu_request' inside netdev-dpdk > to 'requested_mtu'? I think, this would be more clear and > consistent with other configurable parameters (n_rxq, n_txq, ...). 'mtu_request' was the name suggested by Daniele, following a discussion with colleagues. I don't have strong feelings either way, so I'll leave Daniele to comment. > >2. I'd prefer not to fail reconfiguration if there is no enough memory > for new mempool. I think, it'll be common situation when we are > requesting more memory than we have. Failure leads to destruction > of the port and inability to reconnect to vhost-user port after > re-creation if vhost is in server mode. We can just keep old > mempool and inform user via VLOG_ERR. > Agreed - I'll modify V2 accordingly. >3. Minor issues inline. Comments on these inline also. > >What do you think? > >Best regards, Ilya Maximets. > >On 30.07.2016 04:22, Daniele Di Proietto wrote: >> From: Mark Kavanagh <mark.b.kavan...@intel.com> >> >> Add support for Jumbo Frames to DPDK-enabled port types, >> using single-segment-mbufs. >> >> Using this approach, the amount of memory allocated to each mbuf >> to store frame data is increased to a value greater than 1518B >> (typical Ethernet maximum frame length). The increased space >> available in the mbuf means that an entire Jumbo Frame of a specific >> size can be carried in a single mbuf, as opposed to partitioning >> it across multiple mbuf segments. >> >> The amount of space allocated to each mbuf to hold frame data is >> defined dynamically by the user with ovs-vsctl, via the 'mtu_request' >> parameter. >> >> Signed-off-by: Mark Kavanagh <mark.b.kavan...@intel.com> >> [diproiet...@vmware.com rebased] >> Signed-off-by: Daniele Di Proietto <diproiet...@vmware.com> >> --- >> INSTALL.DPDK-ADVANCED.md | 59 +++++++++++++++++- >> INSTALL.DPDK.md | 1 - >> NEWS | 1 + >> lib/netdev-dpdk.c | 151 >> +++++++++++++++++++++++++++++++++++++++-------- >> 4 files changed, 185 insertions(+), 27 deletions(-) >> >> diff --git a/INSTALL.DPDK-ADVANCED.md b/INSTALL.DPDK-ADVANCED.md >> index 191e69e..5cd64bf 100755 >> --- a/INSTALL.DPDK-ADVANCED.md >> +++ b/INSTALL.DPDK-ADVANCED.md >> @@ -1,5 +1,5 @@ >> OVS DPDK ADVANCED INSTALL GUIDE >> -================================= >> +=============================== >> >> ## Contents >> >> @@ -12,7 +12,8 @@ OVS DPDK ADVANCED INSTALL GUIDE >> 7. [QOS](#qos) >> 8. [Rate Limiting](#rl) >> 9. [Flow Control](#fc) >> -10. [Vsperf](#vsperf) >> +10. [Jumbo Frames](#jumbo) >> +11. [Vsperf](#vsperf) >> >> ## <a name="overview"></a> 1. Overview >> >> @@ -862,7 +863,59 @@ respective parameter. To disable the flow control at tx >> side, >> >> `ovs-vsctl set Interface dpdk0 options:tx-flow-ctrl=false` >> >> -## <a name="vsperf"></a> 10. Vsperf >> +## <a name="jumbo"></a> 10. Jumbo Frames >> + >> +By default, DPDK ports are configured with standard Ethernet MTU (1500B). To >> +enable Jumbo Frames support for a DPDK port, change the Interface's >> `mtu_request` >> +attribute to a sufficiently large value. >> + >> +e.g. Add a DPDK Phy port with MTU of 9000: >> + >> +`ovs-vsctl add-port br0 dpdk0 -- set Interface dpdk0 type=dpdk -- set >> Interface dpdk0 >mtu_request=9000` >> + >> +e.g. Change the MTU of an existing port to 6200: >> + >> +`ovs-vsctl set Interface dpdk0 mtu_request=6200` >> + >> +When Jumbo Frames are enabled, the size of a DPDK port's mbuf segments are >> +increased, such that a full Jumbo Frame of a specific size may be >> accommodated >> +within a single mbuf segment. >> + >> +Jumbo frame support has been validated against 9728B frames (largest frame >> size >> +supported by Fortville NIC), using the DPDK `i40e` driver, but larger frames >> +(particularly in use cases involving East-West traffic only), and other >> DPDK NIC >> +drivers may be supported. >> + >> +### 9.1 vHost Ports and Jumbo Frames >> + >> +Some additional configuration is needed to take advantage of jumbo frames >> with >> +vhost ports: >> + >> + 1. `mergeable buffers` must be enabled for vHost ports, as demonstrated >> in >> + the QEMU command line snippet below: >> + >> + ``` >> + '-netdev type=vhost-user,id=mynet1,chardev=char0,vhostforce \' >> + '-device >> virtio-net-pci,mac=00:00:00:00:00:01,netdev=mynet1,mrg_rxbuf=on' >> + ``` >> + >> + 2. Where virtio devices are bound to the Linux kernel driver in a guest >> + environment (i.e. interfaces are not bound to an in-guest DPDK >> driver), >> + the MTU of those logical network interfaces must also be increased >> to a >> + sufficiently large value. This avoids segmentation of Jumbo Frames >> + received in the guest. Note that 'MTU' refers to the length of the IP >> + packet only, and not that of the entire frame. >> + >> + To calculate the exact MTU of a standard IPv4 frame, subtract the L2 >> + header and CRC lengths (i.e. 18B) from the max supported frame size. >> + So, to set the MTU for a 9018B Jumbo Frame: >> + >> + ``` >> + ifconfig eth1 mtu 9000 >> + ``` >> +>>>>>>> 5ec921d... netdev-dpdk: add support for Jumbo Frames > >Looks like rebasing artefact. Yup - I'll remove in V2. > >> + >> +## <a name="vsperf"></a> 11. Vsperf >> >> Vsperf project goal is to develop vSwitch test framework that can be used to >> validate the suitability of different vSwitch implementations in a Telco >> deployment >> diff --git a/INSTALL.DPDK.md b/INSTALL.DPDK.md >> index 7609aa7..25c79de 100644 >> --- a/INSTALL.DPDK.md >> +++ b/INSTALL.DPDK.md >> @@ -590,7 +590,6 @@ can be found in [Vhost Walkthrough]. >> >> ## <a name="ovslimits"></a> 6. Limitations >> >> - - Supports MTU size 1500, MTU setting for DPDK netdevs will be in future >> OVS release. >> - Currently DPDK ports does not use HW offload functionality. >> - Network Interface Firmware requirements: >> Each release of DPDK is validated against a specific firmware version >> for >> diff --git a/NEWS b/NEWS >> index 0ff5616..c004e5f 100644 >> --- a/NEWS >> +++ b/NEWS >> @@ -68,6 +68,7 @@ Post-v2.5.0 >> is enabled in DPDK. >> * Basic connection tracking for the userspace datapath (no ALG, >> fragmentation or NAT support yet) >> + * Jumbo frame support >> - Increase number of registers to 16. >> - ovs-benchmark: This utility has been removed due to lack of use and >> bitrot. >> diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c >> index 0b6e410..68639ae 100644 >> --- a/lib/netdev-dpdk.c >> +++ b/lib/netdev-dpdk.c >> @@ -82,6 +82,7 @@ static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(5, >> 20); >> + sizeof(struct dp_packet) \ >> + RTE_PKTMBUF_HEADROOM) >> #define NETDEV_DPDK_MBUF_ALIGN 1024 >> +#define NETDEV_DPDK_MAX_PKT_LEN 9728 >> >> /* Max and min number of packets in the mempool. OVS tries to allocate a >> * mempool with MAX_NB_MBUF: if this fails (because the system doesn't have >> @@ -336,6 +337,7 @@ struct netdev_dpdk { >> struct ovs_mutex mutex OVS_ACQ_AFTER(dpdk_mutex); >> >> struct dpdk_mp *dpdk_mp; >> + int mtu_request; >> int mtu; >> int socket_id; >> int buf_size; >> @@ -474,10 +476,19 @@ dpdk_mp_get(int socket_id, int mtu) >> OVS_REQUIRES(dpdk_mutex) >> dmp->mtu = mtu; >> dmp->refcount = 1; >> mbp_priv.mbuf_data_room_size = MBUF_SIZE(mtu) - sizeof(struct >> dp_packet); >> - mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) - >> - sizeof (struct rte_mbuf); >> + mbp_priv.mbuf_priv_size = sizeof (struct dp_packet) >> + - sizeof (struct rte_mbuf); >> + /* XXX: this is a really rough method of provisioning memory. >> + * It's impossible to determine what the exact memory requirements are >> when >> + * the number of ports and rxqs that utilize a particular mempool can >> change >> + * dynamically at runtime. For the moment, use this rough heurisitic. >> + */ >> + if (mtu >= ETHER_MTU) { >> + mp_size = MAX_NB_MBUF; >> + } else { >> + mp_size = MIN_NB_MBUF; >> + } >> >> - mp_size = MAX_NB_MBUF; >> do { >> if (snprintf(mp_name, RTE_MEMPOOL_NAMESIZE, "ovs_mp_%d_%d_%u", >> dmp->mtu, dmp->socket_id, mp_size) < 0) { >> @@ -522,6 +533,32 @@ dpdk_mp_put(struct dpdk_mp *dmp) >> #endif >> } >> >> +static int >> +dpdk_mp_configure(struct netdev_dpdk *dev) >> + OVS_REQUIRES(dpdk_mutex) >> + OVS_REQUIRES(dev->mutex) >> +{ >> + uint32_t buf_size = dpdk_buf_size(dev->mtu); >> + struct dpdk_mp *mp_old, *mp; >> + >> + mp_old = dev->dpdk_mp; > >Do we need this variable? We can just put old mempool before >assigning the new one. Agreed - will remove in V2. > >> + >> + mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size)); >> + if (!mp) { >> + VLOG_ERR("Insufficient memory to create memory pool for netdev >> %s\n", >> + dev->up.name); > >+1 space character. Sure - will fix. > >> + return ENOMEM; >> + } >> + >> + dev->dpdk_mp = mp; >> + dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >> + >> + dpdk_mp_put(mp_old); >> + >> + return 0; >> +} >> + >> + >> static void >> check_link_status(struct netdev_dpdk *dev) >> { >> @@ -573,7 +610,15 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >> n_rxq, int >n_txq) >> { >> int diag = 0; >> int i; >> + struct rte_eth_conf conf = port_conf; >> >> + if (dev->mtu > ETHER_MTU) { >> + conf.rxmode.jumbo_frame = 1; >> + conf.rxmode.max_rx_pkt_len = dev->max_packet_len; >> + } else { >> + conf.rxmode.jumbo_frame = 0; >> + conf.rxmode.max_rx_pkt_len = 0; > >I know, it was implemented this way in the original patch, but I'm >not sure that all DPDK drivers will handle zero value of >'max_rx_pkt_len' in a right way. Out of interest, can you point to any examples of DPDK drivers handling max_rx_pkt_len = 0 with unintended behaviour? This field should only be of relevance if rxmode.jumbo_frame = 1, and 0 is its default value; if it is an issue though, I can just default instead to ETHER_MAX_LEN. > >> + } >> /* A device may report more queues than it makes available (this has >> * been observed for Intel xl710, which reserves some of them for >> * SRIOV): rte_eth_*_queue_setup will fail if a queue is not >> @@ -584,8 +629,10 @@ dpdk_eth_dev_queue_setup(struct netdev_dpdk *dev, int >> n_rxq, int >n_txq) >> VLOG_INFO("Retrying setup with (rxq:%d txq:%d)", n_rxq, n_txq); >> } >> >> - diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, >> &port_conf); >> + diag = rte_eth_dev_configure(dev->port_id, n_rxq, n_txq, &conf); >> if (diag) { >> + VLOG_WARN("Interface %s eth_dev setup error %s\n", >> + dev->up.name, rte_strerror(-diag)); >> break; >> } >> >> @@ -738,7 +785,6 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int >> port_no, >> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >> int sid; >> int err = 0; >> - uint32_t buf_size; >> >> ovs_mutex_init(&dev->mutex); >> ovs_mutex_lock(&dev->mutex); >> @@ -759,13 +805,11 @@ netdev_dpdk_init(struct netdev *netdev, unsigned int >> port_no, >> dev->port_id = port_no; >> dev->type = type; >> dev->flags = 0; >> - dev->mtu = ETHER_MTU; >> + dev->mtu_request = dev->mtu = ETHER_MTU; >> dev->max_packet_len = MTU_TO_FRAME_LEN(dev->mtu); >> >> - buf_size = dpdk_buf_size(dev->mtu); >> - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, FRAME_LEN_TO_MTU(buf_size)); >> - if (!dev->dpdk_mp) { >> - err = ENOMEM; >> + err = dpdk_mp_configure(dev); >> + if (err) { >> goto unlock; >> } >> >> @@ -986,6 +1030,7 @@ netdev_dpdk_get_config(const struct netdev *netdev, >> struct smap *args) >> smap_add_format(args, "configured_rx_queues", "%d", netdev->n_rxq); >> smap_add_format(args, "requested_tx_queues", "%d", >> dev->requested_n_txq); >> smap_add_format(args, "configured_tx_queues", "%d", netdev->n_txq); >> + smap_add_format(args, "mtu", "%d", dev->mtu); >> ovs_mutex_unlock(&dev->mutex); >> >> return 0; >> @@ -1362,6 +1407,7 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int >> qid, >> struct rte_mbuf **cur_pkts = (struct rte_mbuf **) pkts; >> unsigned int total_pkts = cnt; >> unsigned int qos_pkts = cnt; >> + unsigned int mtu_dropped = 0; >> int retries = 0; >> >> qid = dev->tx_q[qid % netdev->n_txq].map; >> @@ -1383,25 +1429,41 @@ __netdev_dpdk_vhost_send(struct netdev *netdev, int >> qid, >> do { >> int vhost_qid = qid * VIRTIO_QNUM + VIRTIO_RXQ; >> unsigned int tx_pkts; >> + unsigned int try_tx_pkts = cnt; >> >> + for (unsigned int i = 0; i < cnt; i++) { >> + if (cur_pkts[i]->pkt_len > dev->max_packet_len) { >> + try_tx_pkts = i; >> + break; >> + } >> + } >> + if (!try_tx_pkts) { >> + cur_pkts++; >> + mtu_dropped++; >> + cnt--; >> + continue; >> + } >> tx_pkts = rte_vhost_enqueue_burst(virtio_dev, vhost_qid, >> - cur_pkts, cnt); >> + cur_pkts, try_tx_pkts); >> if (OVS_LIKELY(tx_pkts)) { >> /* Packets have been sent.*/ >> cnt -= tx_pkts; >> /* Prepare for possible retry.*/ >> cur_pkts = &cur_pkts[tx_pkts]; >> + if (tx_pkts != try_tx_pkts) { >> + retries++; >> + } >> } else { >> /* No packets sent - do not retry.*/ >> break; >> } >> - } while (cnt && (retries++ < VHOST_ENQ_RETRY_NUM)); >> + } while (cnt && (retries <= VHOST_ENQ_RETRY_NUM)); >> >> rte_spinlock_unlock(&dev->tx_q[qid].tx_lock); >> >> rte_spinlock_lock(&dev->stats_lock); >> - cnt += qos_pkts; >> - netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, >> cnt); >> + netdev_dpdk_vhost_update_tx_counters(&dev->stats, pkts, total_pkts, >> + cnt + mtu_dropped + qos_pkts); >> rte_spinlock_unlock(&dev->stats_lock); >> >> out: >> @@ -1635,6 +1697,26 @@ netdev_dpdk_get_mtu(const struct netdev *netdev, int >> *mtup) >> } >> >> static int >> +netdev_dpdk_set_mtu(struct netdev *netdev, int mtu) >> +{ >> + struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >> + >> + if (MTU_TO_FRAME_LEN(mtu) > NETDEV_DPDK_MAX_PKT_LEN) { >> + VLOG_WARN("Unsupported MTU (%d)\n", mtu); >> + return EINVAL; >> + } >> + >> + ovs_mutex_lock(&dev->mutex); >> + if (dev->mtu_request != mtu) { >> + dev->mtu_request = mtu; >> + netdev_request_reconfigure(netdev); >> + } >> + ovs_mutex_unlock(&dev->mutex); >> + >> + return 0; >> +} >> + >> +static int >> netdev_dpdk_get_carrier(const struct netdev *netdev, bool *carrier); >> >> static int >> @@ -2787,7 +2869,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev) >> ovs_mutex_lock(&dev->mutex); >> >> if (netdev->n_txq == dev->requested_n_txq >> - && netdev->n_rxq == dev->requested_n_rxq) { >> + && netdev->n_rxq == dev->requested_n_rxq >> + && dev->mtu == dev->mtu_request) { >> /* Reconfiguration is unnecessary */ >> >> goto out; >> @@ -2795,6 +2878,14 @@ netdev_dpdk_reconfigure(struct netdev *netdev) >> >> rte_eth_dev_stop(dev->port_id); >> >> + if (dev->mtu != dev->mtu_request) { >> + dev->mtu = dev->mtu_request; >> + err = dpdk_mp_configure(dev); >> + if (err) { >> + goto out; >> + } >> + } >> + >> netdev->n_txq = dev->requested_n_txq; >> netdev->n_rxq = dev->requested_n_rxq; >> >> @@ -2802,6 +2893,8 @@ netdev_dpdk_reconfigure(struct netdev *netdev) >> err = dpdk_eth_dev_init(dev); >> netdev_dpdk_alloc_txq(dev, netdev->n_txq); >> >> + netdev_change_seq_changed(netdev); >> + >> out: >> >> ovs_mutex_unlock(&dev->mutex); >> @@ -2830,20 +2923,23 @@ netdev_dpdk_vhost_user_reconfigure(struct netdev >> *netdev) >> >> netdev_dpdk_remap_txqs(dev); >> >> - if (dev->requested_socket_id != dev->socket_id) { >> + if (dev->requested_socket_id != dev->socket_id >> + || dev->mtu_request != dev->mtu) { >> dev->socket_id = dev->requested_socket_id; >> - /* Change mempool to new NUMA Node */ >> - dpdk_mp_put(dev->dpdk_mp); >> - dev->dpdk_mp = dpdk_mp_get(dev->socket_id, dev->mtu); >> - if (!dev->dpdk_mp) { >> - err = ENOMEM; >> + dev->mtu = dev->mtu_request; >> + /* Change mempool to new NUMA Node and to new MTU. */ >> + err = dpdk_mp_configure(dev); >> + if (err) { >> + goto out; >> } >> + netdev_change_seq_changed(netdev); >> } >> >> if (virtio_dev) { >> virtio_dev->flags |= VIRTIO_DEV_RUNNING; >> } >> >> +out: >> ovs_mutex_unlock(&dev->mutex); >> ovs_mutex_unlock(&dpdk_mutex); >> >> @@ -2854,6 +2950,7 @@ static int >> netdev_dpdk_vhost_cuse_reconfigure(struct netdev *netdev) >> { >> struct netdev_dpdk *dev = netdev_dpdk_cast(netdev); >> + int err = 0; >> >> ovs_mutex_lock(&dpdk_mutex); >> ovs_mutex_lock(&dev->mutex); >> @@ -2861,10 +2958,18 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev >> *netdev) >> netdev->n_txq = dev->requested_n_txq; >> netdev->n_rxq = 1; >> >> + if (dev->mtu_request != dev->mtu) { >> + /* Change mempool to new MTU. */ >> + err = dpdk_mp_configure(dev); >> + if (!err) { >> + netdev_change_seq_changed(netdev); >> + } >> + } >> + >> ovs_mutex_unlock(&dev->mutex); >> ovs_mutex_unlock(&dpdk_mutex); >> >> - return 0; >> + return err; >> } >> >> #define NETDEV_DPDK_CLASS(NAME, INIT, CONSTRUCT, DESTRUCT, \ >> @@ -2898,7 +3003,7 @@ netdev_dpdk_vhost_cuse_reconfigure(struct netdev >> *netdev) >> netdev_dpdk_set_etheraddr, \ >> netdev_dpdk_get_etheraddr, \ >> netdev_dpdk_get_mtu, \ >> - NULL, /* set_mtu */ \ >> + netdev_dpdk_set_mtu, \ >> netdev_dpdk_get_ifindex, \ >> GET_CARRIER, \ >> netdev_dpdk_get_carrier_resets, \ >> _______________________________________________ dev mailing list dev@openvswitch.org http://openvswitch.org/mailman/listinfo/dev