Hi Ilya,
Given the timeline, I'm replying in-line to the critical parts of the
email (the ones that question the feature itself). I'll come back and
reply to the rest later.
On 24/07/2018 17:55, Ilya Maximets wrote:
> Hi.
> Just wanted to add some comments for the use-cases and testing methodology.
> See inline.
>
> And I'm actually not sure if there any profit from this patch set?
> It looks like an internal mbuf handling rework that only degrades
> the performance and complicates the code.
>
> Please, don't consider me as merge blocker. I just want to understand
> why you think we need this 1200 LOCs?
>
Of which 1/2 are related to tests. Granted it is still LOC, but don't
directly impact the dp-packet/netdev-dpdk modules. I believe that
distinction is important.
> ---
> About 'resize()' related discussion:
> Maybe it's worth to allow dp_packet APIs to return different dp_packet.
> In this case we'll be able to just clone the packet to malloced memory
> and resize in cases of not enough headroom available.
> Like:
> packet = eth_push_vlan(packet, vlan->vlan_tpid, vlan->vlan_tci);
> or
> eth_push_vlan(&packet, vlan->vlan_tpid, vlan->vlan_tci);
>
> This will have a little performance penalty in compare with data shifting
> inside the mbuf, but will be much more elegant, and will allow to eliminate
> all the OVS_NOT_REACHED cases.
>
>
> Best regards, Ilya Maximets.
>
> On 24.07.2018 17:25, Tiago Lam wrote:
>> Overview
>> ========
>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>> Multi-segment mbufs are typically used when the size of an mbuf is
>> insufficient to contain the entirety of a packet's data. Instead, the
>> data is split across numerous mbufs, each carrying a portion, or
>> 'segment', of the packet data. mbufs are chained via their 'next'
>> attribute (an mbuf pointer).
>>
>> Use Cases
>> =========
>> i. Handling oversized (guest-originated) frames, which are marked
>> for hardware accelration/offload (TSO, for example).
>
> This is not a real use case as vhost doesn't support TSO and other
> offloading that requires handling segmented mbufs, so, guests are
> not allowed to send packets larger than MTU. netdev-linux interfaces
> also not allowed to receive packets larger than MTU.
>
> TSO and other offloading support will require much more work, and,
> in fact, the profit from it is controversial.
>
There might be some context missing here. As you mentioned, TSO is not
possible at the moment with OvS-DPDK. However, we are working on
bringing TSO / GRO to OvS-DPDK and this serves as preparatory work to
enable those features. These are features some customers are requesting
for, and closes a feature gap between OvS and OvS-DPDK.
>>
>> Packets which originate from a non-DPDK source may be marked for
>> offload; as such, they may be larger than the permitted ingress
>> interface's MTU, and may be stored in an oversized dp-packet. In
>> order to transmit such packets over a DPDK port, their contents
>> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>> its current implementation, that function only copies data into
>> a single mbuf; if the space available in the mbuf is exhausted,
>> but not all packet data has been copied, then it is lost.
>> Similarly, when cloning a DPDK mbuf, it must be considered
>> whether that mbuf contains multiple segments. Both issues are
>> resolved within this patchset.>
>> ii. Handling jumbo frames.
>
> Different internal representation of big packets. Why we need this?
>
I believe the above answer should cover this as well.
>>
>> While OvS already supports jumbo frames, it does so by increasing
>> mbuf size, such that the entirety of a jumbo frame may be handled
>> in a single mbuf. This is certainly the preferred, and most
>> performant approach (and remains the default).
>>
>> Enabling multi-segment mbufs
>> ============================
>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>> user must decide on which approach to adopt on init. The introduction
>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>
>> This is a global boolean value, which determines how jumbo frames are
>> represented across all DPDK ports. In the absence of a user-supplied
>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>> mbufs must be explicitly enabled / single-segment mbufs remain the
>> default.
>>
>> Setting the field is identical to setting existing DPDK-specific OVSDB
>> fields:
>>
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>
>> Performance notes (based on v8)
>> =================
>> In order to test for regressions in performance, tests were run on top
>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>> mbufs option enabled and disabled.
>>
>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>
>
> I'm sorry, but these performance tests and below analysis are mostly useless.
> It's makes no sense to test 1.5K or 7K bytes packets with 10G interface.
> OVS is capable to handle this traffic without any drops in almost any
> configuration. My PVP test with bonded phy easily takes almost 2 times
> higher traffic rates using 2x10G interfaces in balanced bonding with disabled
> caches. Below table just reports that you're reaching the limit of the
> 10G physical NICs.
>
> Just wanted to point on that. Next time, please, use better test scenarios.
>
> I expect that enabling multi-segment mbufs will reduce performance in all
> scenarios for a few percents.
>
>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>> -------------------------------------------------------------
>> p2p | 64 | ~22.7 | ~22.65 | ~18.3
>> p2p | 1500 | ~1.6 | ~1.6 | ~1.6
>> p2p | 7000 | ~0.36 | ~0.36 | ~0.36
>> pvp | 64 | ~6.7 | ~6.7 | ~6.3
>> pvp | 1500 | ~1.6 | ~1.6 | ~1.6
>> pvp | 7000 | ~0.36 | ~0.36 | ~0.36
>>
>> Packet size is in bytes, while all packet rates are reported in mpps
>> (aggregated).
>>
>> No noticeable regression has been observed (certainly everything is
>> within the ± 5% margin of existing performance), aside from the 64B
>> packet size case when multi-segment mbuf is enabled. This is
>> expected, however, because of how Tx vectoriszed functions are
>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>> use during these tests was the i40e (on a Intel X710 NIC), which
>> indeed doesn't support vectorized Tx functions with multi-segment
>> mbufs.
>>
>> ---
>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command
>> parser
>> error paths.");
>> - Address Darrell's comments:
>> - The changes in dp_packet_resize__() were trying to alleviate the call
>> to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the
>> available
>> tailroom space when no more headroom space is available, and
>> vice-versa.
>> However, this was breaking the API for the dp_packet_resize__()
>> function (only called in dp_packet_prealloc_tailroom() and
>> dp_packet_prealloc_headroom()), which doesn't seem to suit the
>> purposes
>> for DPDK packets.
>> Instead, and because this is isolate funtionality, revert to the
>> previous state where dp_packet_resize__() is not supported for DPDK
>> packets. Hence, then patch 08/14 has been dropped.
>> - Additionally, fix the tests that were relying on the removed
>> functionality.
>>
>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>> after free.");
>> - Address Eelco's comments:
>> - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>> leftover from rebase. Only call should be in dpdk_mp_get();
>> - Remove NEWS line added by mistake during rebase (about adding
>> experimental vhost zero copy support).
>> - Address Ian's comments:
>> - Drop patch 01 from previous series entirely;
>> - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>> dpdk_mp_create() to get the correct "mbuf_size" to be used;
>> - Patch (now) 11/14 modifies dpdk_mp_create() to check if multi-segment
>> mbufs is enabled, in which case it calculates the new "mbuf_size" to
>> be
>> used;
>> - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>> conditionally.
>> - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the
>> current
>> DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes
>> using
>> the shared mempool model (runs out of memory);
>> - Add new examples for when multi-segment mbufs are enabled in
>> topics/dpdk/memory.rst, and a reference to topics/dpdk/jumbo-frames.rst
>> (now patch 11/14).
>>
>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in qos-add
>> command."):
>> - A new patch (now 01/15) has been introduced to differentiate between
>> MTU and mbuf size when creating the mempool. This is because as part
>> of
>> the new support for both per port and shared mempools, mempools were
>> being reused based on the mbuf size, and for multi-segment mbufs the
>> mbuf size can end up being the same for all mbufs;
>> - A couple of other patches (now 02/15 and 12/15) have been modified as
>> part of the rebase, but only to adapt to the code changes to "Support
>> both shared and per port mempools.", no functionality should have
>> been
>> changed.
>>
>> v3:
>> - Address Eelco's comment:
>> - Fix the ovs_assert() introduced in v2 in __packet_set_data(), which
>> wasn't correctly asserting that the passed 'v' was smaller than the
>> first mbuf's buf_len.
>>
>> v2:
>> - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>> error.");
>> - Address Eelco's comments:
>> - In free_dpdk_buf(), use mbuf's struct address in dp_packet instead
>> of
>> casting;
>> - Remove unneeded variable in dp_packet_set_size(), pointing to the
>> first mbuf in the chain;
>> - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>> always true for DPBUF_DPDK packets;
>> - Assert in __packet_set_data() to enforce that data_off never goes
>> beyond the first mbuf in the chain.
>>
>> v1:
>> - v8 should have been sent as v1 really, as that's usually the approach
>> followed in OvS. That clearly didn't happen, so restarting the series
>> now. This also helps making it clear it is no longer an RFC series;
>> - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>> output.");
>> - Address Eelco's comments:
>> - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>> `pkt_len` and `data_len` can't be set to values bigger than the
>> available space. Also fix assigment to `data_len` which was
>> incorrectly being set to just`pkt_len`;
>> - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>> why the mutex is needed;
>> - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>> `DPBUF_DPDK` packets;
>> - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>> - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read()
>> for
>> copying `DPBUF_DPDK` packets' data. Also, change it to return
>> appropriate and meaningful errors, instead of just "0" or "1";
>> - Change dpdk_prep_tx_buf() name's to dpdk_clone_dp_packet_to_mbuf(),
>> and reuse dp_packet_mbuf_write() instead of manual copy;
>> - Add note vswitchd/vswitch.xml to make it clear the enabling of
>> dpdk-multi-seg-mbufs requires a restart;
>> - Change dpdk_mp_create() to increase # mbufs used under the
>> multi-segment
>> mbufs approach;
>> - Increase test coverage by adding "end-to-end" tests that verify that
>> "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>> successfully sent out;
>> - Some helper funcs such as dp_packet_tail() and dp_packet_end() were
>> moved
>> back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets, to
>> minimise changes;
>> - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>> after further testing;
>> - Structure changes:
>> - Drop patch 07/13 which is now unneeded;
>> - Two more patches added for extra test coverage. This is what accounts
>> for the increase in size (+1 patch) in the series.
>>
>> v8 (non-RFC):
>> - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>> temporary directory (also for RHEL)");
>> - Address Ciara's and Ilya's comments:
>> - Drop the dp_packet_mbuf_tail() function and use only the
>> already existing dp_packet_tail();
>> - Fix bug in dpdk_do_tx_copy() where the error return from
>> dpdk_prep_tx_buf() was being wrongly checked;
>> - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>> rte_pktmbuf_alloc() and rte_pktmbuf_free();
>> - Fix some other code style and duplication issues pointed out.
>> - Refactored dp_packet_shift(), dp_packet_resize__() and
>> dp_packet_put*() functions to work within the bounds of existing
>> mbufs only;
>> - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>> other mbufs in the chain for chains with more than a single mbuf;
>> - dp_packet layer functions (such as dp_packet_l3()) now check if
>> the header is within the first mbuf, when using mbufs;
>> - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>> was refactored to use free_dpdk_buf();
>> - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>> leading to memory corruption;
>> - Modified the added tests to account for some of the above changes;
>> - Run performance tests, compiling results and adding them to the
>> cover letter;
>> - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>> together with a "Performance notes" sub-section reflecting the
>> findings mentioned above in the cover letter.
>>
>> v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>> - Address Ilya comments;
>> - Fix non-DPDK build;
>> - Serialise the access of non pmds to allocation and free of mbufs by
>> using a newly introduced mutex.
>> - Add a new set of tests that integrates with the recently added DPDK
>> testsuite. These focus on allocating dp_packets, with a single or
>> multiple mbufs, from an instantiated mempool and performing several
>> operations on those, verifying if the data at the end matches what's
>> expected;
>> - Fix bugs found by the new tests:
>> - dp_packet_shift() wasn't taking into account shift lefts;
>> - dp_packet_resize__() was misusing and miscalculating the tailrooms
>> and headrooms, ending up calculating the wrong number of segments
>> that needed allocation;
>> - An mbuf's end was being miscalculated both in dp_packet_tail,
>> dp_packet_mbuf_tail() and dp_packet_end();
>> - dp_packet_set_size() was not updating the number of chained
>> segments
>> 'nb_segs';
>> - Add support for multi-seg mbufs in dp_packet_clear().
>>
>> v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>> reuse.");
>> - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>> support multi-seg mbufs;
>> - Add support for multi-seg mbufs in dp_packet_l4_size() and
>> improve other helper funcs, such as dp_packet_tail() and dp_
>> packet_tailroom().
>> - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>> put_zeros(), as well as dp_packet_resize__() - allocating new
>> mbufs and linking them together;
>> Restructured patchset:
>> - Squash patch 5 into patch 6, since they were both related to
>> copying data while handling multi-seg mbufs;
>> - Split patch 4 into two separate patches - one that introduces the
>> changes in helper functions to deal with multi-seg mbufs and
>> two others for the shift() and put_uninit() functionality;
>> - Move patch 4 to before patch 3, so that ihelper functions come
>> before functionality improvement that rely on those helpers.
>>
>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>> from getting redundantly incremented");
>> - Sugesh's comments have been addressed:
>> - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>> make them independent of each other;
>> - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>> size() are independent;
>> - dp_packet_clone_with_headroom() now has split functions for
>> handling DPDK sourced packets and non-DPDK packets;
>> - Modified various functions in dp-packet.h to account for multi-seg
>> mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>> and dp_packet_at();
>> - Added support for shifting packet data in multi-seg mbufs, using
>> dp_packet_shift();
>> - Fixed some minor inconsistencies.
>>
>> Note that some of the changes in v5 have been contributed by Mark
>> Kavanagh as well.
>>
>> v4: - restructure patchset
>> - account for 128B ARM cacheline when sizing mbufs
>>
>> Mark Kavanagh (4):
>> netdev-dpdk: fix mbuf sizing
>> dp-packet: Init specific mbuf fields.
>> netdev-dpdk: copy large packet to multi-seg. mbufs
>> netdev-dpdk: support multi-segment jumbo frames
>>
>> Michael Qiu (1):
>> dp-packet: copy data from multi-seg. DPDK mbuf
>>
>> Tiago Lam (8):
>> dp-packet: Fix allocated size on DPDK init.
>> netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>> dp-packet: Fix data_len handling multi-seg mbufs.
>> dp-packet: Handle multi-seg mbufs in helper funcs.
>> dp-packet: Handle multi-seg mubfs in shift() func.
>> dpdk-tests: Add uni-tests for multi-seg mbufs.
>> dpdk-tests: Accept other configs in OVS_DPDK_START
>> dpdk-tests: End-to-end tests for multi-seg mbufs.
>>
>> Documentation/topics/dpdk/jumbo-frames.rst | 52 +++
>> Documentation/topics/dpdk/memory.rst | 36 ++
>> NEWS | 1 +
>> lib/dp-packet.c | 173 +++++++++-
>> lib/dp-packet.h | 214 ++++++++++--
>> lib/dpdk.c | 8 +
>> lib/netdev-dpdk.c | 244 +++++++++++---
>> lib/netdev-dpdk.h | 2 +
>> tests/automake.mk | 10 +-
>> tests/dpdk-packet-mbufs.at | 7 +
>> tests/system-dpdk-macros.at | 6 +-
>> tests/system-dpdk-testsuite.at | 1 +
>> tests/system-dpdk.at | 65 ++++
>> tests/test-dpdk-mbufs.c | 513
>> +++++++++++++++++++++++++++++
>> vswitchd/vswitch.xml | 22 ++
>> 15 files changed, 1277 insertions(+), 77 deletions(-)
>> create mode 100644 tests/dpdk-packet-mbufs.at
>> create mode 100644 tests/test-dpdk-mbufs.c
>>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev