Overview ======== This patchset introduces support for multi-segment mbufs to OvS-DPDK. Multi-segment mbufs are typically used when the size of an mbuf is insufficient to contain the entirety of a packet's data. Instead, the data is split across numerous mbufs, each carrying a portion, or 'segment', of the packet data. mbufs are chained via their 'next' attribute (an mbuf pointer).
Use Cases ========= i. Handling oversized (guest-originated) frames, which are marked for hardware accelration/offload (TSO, for example). Packets which originate from a non-DPDK source may be marked for offload; as such, they may be larger than the permitted ingress interface's MTU, and may be stored in an oversized dp-packet. In order to transmit such packets over a DPDK port, their contents must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in its current implementation, that function only copies data into a single mbuf; if the space available in the mbuf is exhausted, but not all packet data has been copied, then it is lost. Similarly, when cloning a DPDK mbuf, it must be considered whether that mbuf contains multiple segments. Both issues are resolved within this patchset. ii. Handling jumbo frames. While OvS already supports jumbo frames, it does so by increasing mbuf size, such that the entirety of a jumbo frame may be handled in a single mbuf. This is certainly the preferred, and most performant approach (and remains the default). However, it places high demands on system memory; multi-segment mbufs may be prefereable for systems which are memory-constrained. Enabling multi-segment mbufs ============================ Multi-segment and single-segment mbufs are mutually exclusive, and the user must decide on which approach to adopt on init. The introduction of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this. This is a global boolean value, which determines how jumbo frames are represented across all DPDK ports. In the absence of a user-supplied value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment mbufs must be explicitly enabled / single-segment mbufs remain the default. Setting the field is identical to setting existing DPDK-specific OVSDB fields: ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10 ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0 ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true Performance notes ================= In order to test for regressions in performance, tests were run on top of master 88125d6 and v8 of this patchset, both with the multi-segment mbufs option enabled and disabled. VSperf was used to run the phy2phy_cont and pvp_cont tests with varying packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface. Test | Size | Master | Multi-seg disabled | Multi-seg enabled ------------------------------------------------------------- p2p | 64 | ~22.7 | ~22.65 | ~18.3 p2p | 1500 | ~1.6 | ~1.6 | ~1.6 p2p | 7000 | ~0.36 | ~0.36 | ~0.36 pvp | 64 | ~6.7 | ~6.7 | ~6.3 pvp | 1500 | ~1.6 | ~1.6 | ~1.6 pvp | 7000 | ~0.36 | ~0.36 | ~0.36 Packet size is in bytes, while all packet rates are reported in mpps (aggregated). No noticeable regression has been observed (certainly everything is within the ± 5% margin of existing performance), aside from the 64B packet size case when multi-segment mbuf is enabled. This is expected, however, because of how Tx vectoriszed functions are incompatible with multi-segment mbufs on some PMDs. The PMD under use during these tests was the i40e (on a Intel X710 NIC), which indeed doesn't support vectorized Tx functions with multi-segment mbufs. --- v8 (non-RFC): - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from temporary directory (also for RHEL)"); - Address Ciara's and Ilya's comments: - Drop the dp_packet_mbuf_tail() function and use only the already existing dp_packet_tail(); - Fix bug in dpdk_do_tx_copy() where the error return from dpdk_prep_tx_buf() was being wrongly checked; - Use dpdk_buf_alloc() and free_dpdk_buf() instead of rte_pktmbuf_alloc() and rte_pktmbuf_free(); - Fix some other code style and duplication issues pointed out. - Refactored dp_packet_shift(), dp_packet_resize__() and dp_packet_put*() functions to work within the bounds of existing mbufs only; - Fix dp_packet_clear() which wasn't correctly clearing / freeing other mbufs in the chain for chains with more than a single mbuf; - dp_packet layer functions (such as dp_packet_l3()) now check if the header is within the first mbuf, when using mbufs; - Move patch 08/13 to before patch 04/13, since dp_packet_set_size() was refactored to use free_dpdk_buf(); - Fix wrong rte_memcpy() when performing dp_packet_clone() which was leading to memory corruption; - Modified the added tests to account for some of the above changes; - Run performance tests, compiling results and adding them to the cover letter; - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc, together with a "Performance notes" sub-section reflecting the findings mentioned above in the cover letter. v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version."); - Address Ilya comments; - Fix non-DPDK build; - Serialise the access of non pmds to allocation and free of mbufs by using a newly introduced mutex. - Add a new set of tests that integrates with the recently added DPDK testsuite. These focus on allocating dp_packets, with a single or multiple mbufs, from an instantiated mempool and performing several operations on those, verifying if the data at the end matches what's expected; - Fix bugs found by the new tests: - dp_packet_shift() wasn't taking into account shift lefts; - dp_packet_resize__() was misusing and miscalculating the tailrooms and headrooms, ending up calculating the wrong number of segments that needed allocation; - An mbuf's end was being miscalculated both in dp_packet_tail, dp_packet_mbuf_tail() and dp_packet_end(); - dp_packet_set_size() was not updating the number of chained segments 'nb_segs'; - Add support for multi-seg mbufs in dp_packet_clear(). v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session reuse."); - Further improve dp_packet_put_uninit() and dp_packet_shift() to support multi-seg mbufs; - Add support for multi-seg mbufs in dp_packet_l4_size() and improve other helper funcs, such as dp_packet_tail() and dp_ packet_tailroom(). - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_ put_zeros(), as well as dp_packet_resize__() - allocating new mbufs and linking them together; Restructured patchset: - Squash patch 5 into patch 6, since they were both related to copying data while handling multi-seg mbufs; - Split patch 4 into two separate patches - one that introduces the changes in helper functions to deal with multi-seg mbufs and two others for the shift() and put_uninit() functionality; - Move patch 4 to before patch 3, so that ihelper functions come before functionality improvement that rely on those helpers. v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters from getting redundantly incremented"); - Sugesh's comments have been addressed: - Changed dp_packet_set_data() and dp_packet_set_size() logic to make them independent of each other; - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_ size() are independent; - dp_packet_clone_with_headroom() now has split functions for handling DPDK sourced packets and non-DPDK packets; - Modified various functions in dp-packet.h to account for multi-seg mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail() and dp_packet_at(); - Added support for shifting packet data in multi-seg mbufs, using dp_packet_shift(); - Fixed some minor inconsistencies. Note that some of the changes in v5 have been contributed by Mark Kavanagh as well. v4: - restructure patchset - account for 128B ARM cacheline when sizing mbufs Mark Kavanagh (4): netdev-dpdk: fix mbuf sizing dp-packet: Init specific mbuf fields. netdev-dpdk: copy large packet to multi-seg. mbufs netdev-dpdk: support multi-segment jumbo frames Michael Qiu (1): dp-packet: copy data from multi-seg. DPDK mbuf Tiago Lam (8): dp-packet: Fix allocated size on DPDK init. netdev-dpdk: Serialise non-pmds mbufs' alloc/free. dp-packet: Fix data_len handling multi-seg mbufs. dp-packet: Handle multi-seg mbufs in helper funcs. dp-packet: Handle multi-seg mbufs in put_uninit(). dp-packet: Handle multi-seg mubfs in shift() func. dp-packet: Handle multi-seg mbufs in resize__(). dpdk-tests: Add test coverage for multi-seg mbufs. Documentation/topics/dpdk/jumbo-frames.rst | 41 +++ NEWS | 1 + lib/dp-packet.c | 243 +++++++++++++- lib/dp-packet.h | 319 ++++++++++++++---- lib/dpdk.c | 8 + lib/netdev-dpdk.c | 213 ++++++++++-- lib/netdev-dpdk.h | 2 + tests/automake.mk | 10 +- tests/dpdk-packet-mbufs.at | 7 + tests/system-dpdk-testsuite.at | 1 + tests/test-dpdk-mbufs.c | 518 +++++++++++++++++++++++++++++ vswitchd/vswitch.xml | 20 ++ 12 files changed, 1273 insertions(+), 110 deletions(-) create mode 100644 tests/dpdk-packet-mbufs.at create mode 100644 tests/test-dpdk-mbufs.c -- 2.7.4 _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev