On 14/08/2018 11:17, Ilya Maximets wrote:
> On 14.08.2018 12:45, Lam, Tiago wrote:
>> On 09/08/2018 09:27, Ilya Maximets wrote:
>>> Hmm. I found that this series modifies only dpdk related components
>>> and doesn't pay any attention to others.
>>> dp_packet API was modified to respect the segmented packets, but
>>> there are many places, where code uses packet data directly using
>>> only dp_packet_data() pointer and dp_packet_size().
>>> For example, at least following functions will read the wrong memory
>>> and probably generate segfault:
>>>
>>> * All the netdev-{linux, dummy, bsd} send functions, because
>>> they are using code like this:
>>>
>>> write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>>
>>> * Tunnel extracting functions like 'udp_extract_tnl_md', because
>>> they're trying to calculate packet checksums using raw calls to
>>> 'csum_continue'.
>>>
>>> * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>>
>>> * Possibly, much more functions.
>>>
>>
>> Hi Ilya (and All),
>>
>> I went through the uses that call dp_packet_data(), and some of the
>> cases were covered already, like the "udp_extract_tnl_md" case you
>> mention - because the dp_packet API is changed to not let callers pull
>> headers after the first segment.
>
> Even if headers are in the first segment, L4 checksums includes checksum
> for the *whole packet data*, not only headers/pseudo-headers.
> Look at the calls to "csum_continue".
>
I was referring to udp_extract_tnl_md()'s use of
netdev_tnl_ip_extract_tnl_md() to fetch the UDP header, the use of
csum_continue() I kind of include it in the below category where the
whole data is needed contiguously. But you're correct that
csum_continue() is used in udp_extract_tnl_md().
>>
>> The other cases I consider all to be of the same category: they all need
>> the data contiguous in memory, such as the write() syscall you mention
>> above. Do you see any other corner cases that I may be missing?
>
> I made just a quick git grep. There are, possibly, other cases. Need to
> make deeper research.
>
Thanks, Ilya. I've been doing the same thing now to find this usage of
dp_packet_data(). I'll confirm my findings again and send v8 once I've
the below approach in place.
>>
>> So, to cover this category, where data is expected to be contiguous in
>> memory, we'd need to copy it to a someplace contigous (most likely
>> system's memory). The approach I have in mind is to extend the dp_packet
>> API in order to add a new function that "linearizes" the multi-segmented
>> data. This new function would be called from places such as the write()
>> call in netdev-linux you mention above, instead of the call to
>> dp_packet_data() in place now, and would convert between a DPBUF_DPDK
>> packet into a DPBUF_MALLOC packet. The packet would then continue to be
>> used on as a regular DPBUF_MALLOC packet (instead of a DPBUF_DPDK
>> packet) which would be properly free'ed and returned back to the DPDK
>> mempool's upon dp_packet_delete() / dp_packet_uninit().
>>
>> An upside of this approach is that it would also enable us to follow the
>> ideia proposed on [1] (by yourself), where we could "easily" tranform a
>> DPDK packet into a system packet if no more space exists when calling
>> dp_packet_reuse__().
>
> That is OK. One thing is it'll be good to not copy the whole packet,
> if it's not segmented to avoid perfromance drop for the usual cases.
> This could be hard to cover with generic API.
>
Agreed here. This operation would only be done if the packet is
non-contiguous, otherwise it will remain untouched.
>>
>> This would enable us to use multi-segment mbufs, and later on TSO,
>> between DPDK ports, while still enabling the use cases between DPDK
>> ports and non-DPDK ports. For this latter case, there'd be a one-off
>> (expensive) operation where system's memory needs to be allocated and
>> copied to, but documentation would be further improved to explain this,
>> and I believe warning the user with a WARN log message would also be
>> helpful here as well.
>
> About TSO, there are a lot more questions. At leas you will have to
> implement not only re-assembling, but also splitting the packet in
> parts in case of low MTU on ports that doesn't support TSO.
> Checksum recalculations also should be covered. I'll add some
> comments to you TSO series.
>
Thanks, Ilya. I'll look forward to it.
Tiago.
>>
>> What do you think?
>>
>> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-July/350009.html
>>
>>> If It's true, when I'm afraid that this patch set is far from being
>>> ready for merge.
>>>
>>> On 25.07.2018 17:19, Tiago Lam wrote:
>>>> Overview
>>>> ========
>>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>>> insufficient to contain the entirety of a packet's data. Instead, the
>>>> data is split across numerous mbufs, each carrying a portion, or
>>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>>> attribute (an mbuf pointer).
>>>>
>>>> The main motivation behind the support for multi-segment mbufs is to
>>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>>> planned to be introduced after this series.
>>>>
>>>> Use Cases
>>>> =========
>>>> i. Handling oversized (guest-originated) frames, which are marked
>>>> for hardware accelration/offload (TSO, for example).
>>>>
>>>> Packets which originate from a non-DPDK source may be marked for
>>>> offload; as such, they may be larger than the permitted ingress
>>>> interface's MTU, and may be stored in an oversized dp-packet. In
>>>> order to transmit such packets over a DPDK port, their contents
>>>> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>>> its current implementation, that function only copies data into
>>>> a single mbuf; if the space available in the mbuf is exhausted,
>>>> but not all packet data has been copied, then it is lost.
>>>> Similarly, when cloning a DPDK mbuf, it must be considered
>>>> whether that mbuf contains multiple segments. Both issues are
>>>> resolved within this patchset.
>>>>
>>>> ii. Handling jumbo frames.
>>>>
>>>> While OvS already supports jumbo frames, it does so by increasing
>>>> mbuf size, such that the entirety of a jumbo frame may be handled
>>>> in a single mbuf. This is certainly the preferred, and most
>>>> performant approach (and remains the default).
>>>>
>>>> Enabling multi-segment mbufs
>>>> ============================
>>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>>> user must decide on which approach to adopt on init. The introduction
>>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>>
>>>> This is a global boolean value, which determines how jumbo frames are
>>>> represented across all DPDK ports. In the absence of a user-supplied
>>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>>> default.
>>>>
>>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>>> fields:
>>>>
>>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>>
>>>> Performance notes (based on v8)
>>>> =================
>>>> In order to test for regressions in performance, tests were run on top
>>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>>> mbufs option enabled and disabled.
>>>>
>>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>>
>>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>>> -------------------------------------------------------------
>>>> p2p | 64 | ~22.7 | ~22.65 | ~18.3
>>>> p2p | 1500 | ~1.6 | ~1.6 | ~1.6
>>>> p2p | 7000 | ~0.36 | ~0.36 | ~0.36
>>>> pvp | 64 | ~6.7 | ~6.7 | ~6.3
>>>> pvp | 1500 | ~1.6 | ~1.6 | ~1.6
>>>> pvp | 7000 | ~0.36 | ~0.36 | ~0.36
>>>>
>>>> Packet size is in bytes, while all packet rates are reported in mpps
>>>> (aggregated).
>>>>
>>>> No noticeable regression has been observed (certainly everything is
>>>> within the ± 5% margin of existing performance), aside from the 64B
>>>> packet size case when multi-segment mbuf is enabled. This is
>>>> expected, however, because of how Tx vectoriszed functions are
>>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>>> indeed doesn't support vectorized Tx functions with multi-segment
>>>> mbufs.
>>>>
>>>> ---
>>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>>> - Add Ben's proposed fix for automake's warning;
>>>> - Add a note to cover letter to explain this is preperatory work for
>>>> TSO /
>>>> GRO.
>>>>
>>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command
>>>> parser
>>>> error paths.");
>>>> - Address Darrell's comments:
>>>> - The changes in dp_packet_resize__() were trying to alleviate the
>>>> call
>>>> to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the
>>>> available
>>>> tailroom space when no more headroom space is available, and
>>>> vice-versa.
>>>> However, this was breaking the API for the dp_packet_resize__()
>>>> function (only called in dp_packet_prealloc_tailroom() and
>>>> dp_packet_prealloc_headroom()), which doesn't seem to suit the
>>>> purposes
>>>> for DPDK packets.
>>>> Instead, and because this is isolate funtionality, revert to the
>>>> previous state where dp_packet_resize__() is not supported for DPDK
>>>> packets. Hence, then patch 08/14 has been dropped.
>>>> - Additionally, fix the tests that were relying on the removed
>>>> functionality.
>>>>
>>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg
>>>> use
>>>> after free.");
>>>> - Address Eelco's comments:
>>>> - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>>> leftover from rebase. Only call should be in dpdk_mp_get();
>>>> - Remove NEWS line added by mistake during rebase (about adding
>>>> experimental vhost zero copy support).
>>>> - Address Ian's comments:
>>>> - Drop patch 01 from previous series entirely;
>>>> - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>>> dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>>> - Patch (now) 11/14 modifies dpdk_mp_create() to check if
>>>> multi-segment
>>>> mbufs is enabled, in which case it calculates the new "mbuf_size"
>>>> to be
>>>> used;
>>>> - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>>> conditionally.
>>>> - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the
>>>> current
>>>> DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes
>>>> using
>>>> the shared mempool model (runs out of memory);
>>>> - Add new examples for when multi-segment mbufs are enabled in
>>>> topics/dpdk/memory.rst, and a reference to
>>>> topics/dpdk/jumbo-frames.rst
>>>> (now patch 11/14).
>>>>
>>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in
>>>> qos-add
>>>> command."):
>>>> - A new patch (now 01/15) has been introduced to differentiate
>>>> between
>>>> MTU and mbuf size when creating the mempool. This is because as
>>>> part of
>>>> the new support for both per port and shared mempools, mempools
>>>> were
>>>> being reused based on the mbuf size, and for multi-segment mbufs
>>>> the
>>>> mbuf size can end up being the same for all mbufs;
>>>> - A couple of other patches (now 02/15 and 12/15) have been modified
>>>> as
>>>> part of the rebase, but only to adapt to the code changes to
>>>> "Support
>>>> both shared and per port mempools.", no functionality should have
>>>> been
>>>> changed.
>>>>
>>>> v3:
>>>> - Address Eelco's comment:
>>>> - Fix the ovs_assert() introduced in v2 in __packet_set_data(),
>>>> which
>>>> wasn't correctly asserting that the passed 'v' was smaller than
>>>> the
>>>> first mbuf's buf_len.
>>>>
>>>> v2:
>>>> - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>>> error.");
>>>> - Address Eelco's comments:
>>>> - In free_dpdk_buf(), use mbuf's struct address in dp_packet
>>>> instead of
>>>> casting;
>>>> - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>>> first mbuf in the chain;
>>>> - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>>> always true for DPBUF_DPDK packets;
>>>> - Assert in __packet_set_data() to enforce that data_off never goes
>>>> beyond the first mbuf in the chain.
>>>>
>>>> v1:
>>>> - v8 should have been sent as v1 really, as that's usually the approach
>>>> followed in OvS. That clearly didn't happen, so restarting the series
>>>> now. This also helps making it clear it is no longer an RFC series;
>>>> - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>>> output.");
>>>> - Address Eelco's comments:
>>>> - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>>> `pkt_len` and `data_len` can't be set to values bigger than the
>>>> available space. Also fix assigment to `data_len` which was
>>>> incorrectly being set to just`pkt_len`;
>>>> - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>>> why the mutex is needed;
>>>> - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>>> `DPBUF_DPDK` packets;
>>>> - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>>> - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read()
>>>> for
>>>> copying `DPBUF_DPDK` packets' data. Also, change it to return
>>>> appropriate and meaningful errors, instead of just "0" or "1";
>>>> - Change dpdk_prep_tx_buf() name's to
>>>> dpdk_clone_dp_packet_to_mbuf(),
>>>> and reuse dp_packet_mbuf_write() instead of manual copy;
>>>> - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>>> dpdk-multi-seg-mbufs requires a restart;
>>>> - Change dpdk_mp_create() to increase # mbufs used under the
>>>> multi-segment
>>>> mbufs approach;
>>>> - Increase test coverage by adding "end-to-end" tests that verify that
>>>> "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>>> successfully sent out;
>>>> - Some helper funcs such as dp_packet_tail() and dp_packet_end() were
>>>> moved
>>>> back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets,
>>>> to
>>>> minimise changes;
>>>> - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>>> after further testing;
>>>> - Structure changes:
>>>> - Drop patch 07/13 which is now unneeded;
>>>> - Two more patches added for extra test coverage. This is what
>>>> accounts
>>>> for the increase in size (+1 patch) in the series.
>>>>
>>>> v8 (non-RFC):
>>>> - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>>> temporary directory (also for RHEL)");
>>>> - Address Ciara's and Ilya's comments:
>>>> - Drop the dp_packet_mbuf_tail() function and use only the
>>>> already existing dp_packet_tail();
>>>> - Fix bug in dpdk_do_tx_copy() where the error return from
>>>> dpdk_prep_tx_buf() was being wrongly checked;
>>>> - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>>> rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>>> - Fix some other code style and duplication issues pointed out.
>>>> - Refactored dp_packet_shift(), dp_packet_resize__() and
>>>> dp_packet_put*() functions to work within the bounds of existing
>>>> mbufs only;
>>>> - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>>> other mbufs in the chain for chains with more than a single mbuf;
>>>> - dp_packet layer functions (such as dp_packet_l3()) now check if
>>>> the header is within the first mbuf, when using mbufs;
>>>> - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>>> was refactored to use free_dpdk_buf();
>>>> - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>>> leading to memory corruption;
>>>> - Modified the added tests to account for some of the above changes;
>>>> - Run performance tests, compiling results and adding them to the
>>>> cover letter;
>>>> - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>>> together with a "Performance notes" sub-section reflecting the
>>>> findings mentioned above in the cover letter.
>>>>
>>>> v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>>> - Address Ilya comments;
>>>> - Fix non-DPDK build;
>>>> - Serialise the access of non pmds to allocation and free of mbufs
>>>> by
>>>> using a newly introduced mutex.
>>>> - Add a new set of tests that integrates with the recently added DPDK
>>>> testsuite. These focus on allocating dp_packets, with a single or
>>>> multiple mbufs, from an instantiated mempool and performing several
>>>> operations on those, verifying if the data at the end matches what's
>>>> expected;
>>>> - Fix bugs found by the new tests:
>>>> - dp_packet_shift() wasn't taking into account shift lefts;
>>>> - dp_packet_resize__() was misusing and miscalculating the
>>>> tailrooms
>>>> and headrooms, ending up calculating the wrong number of segments
>>>> that needed allocation;
>>>> - An mbuf's end was being miscalculated both in dp_packet_tail,
>>>> dp_packet_mbuf_tail() and dp_packet_end();
>>>> - dp_packet_set_size() was not updating the number of chained
>>>> segments
>>>> 'nb_segs';
>>>> - Add support for multi-seg mbufs in dp_packet_clear().
>>>>
>>>> v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>>> reuse.");
>>>> - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>>> support multi-seg mbufs;
>>>> - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>>> improve other helper funcs, such as dp_packet_tail() and dp_
>>>> packet_tailroom().
>>>> - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>>> put_zeros(), as well as dp_packet_resize__() - allocating new
>>>> mbufs and linking them together;
>>>> Restructured patchset:
>>>> - Squash patch 5 into patch 6, since they were both related to
>>>> copying data while handling multi-seg mbufs;
>>>> - Split patch 4 into two separate patches - one that introduces the
>>>> changes in helper functions to deal with multi-seg mbufs and
>>>> two others for the shift() and put_uninit() functionality;
>>>> - Move patch 4 to before patch 3, so that ihelper functions come
>>>> before functionality improvement that rely on those helpers.
>>>>
>>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>>> from getting redundantly incremented");
>>>> - Sugesh's comments have been addressed:
>>>> - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>>> make them independent of each other;
>>>> - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>>> size() are independent;
>>>> - dp_packet_clone_with_headroom() now has split functions for
>>>> handling DPDK sourced packets and non-DPDK packets;
>>>> - Modified various functions in dp-packet.h to account for multi-seg
>>>> mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>>> and dp_packet_at();
>>>> - Added support for shifting packet data in multi-seg mbufs, using
>>>> dp_packet_shift();
>>>> - Fixed some minor inconsistencies.
>>>>
>>>> Note that some of the changes in v5 have been contributed by Mark
>>>> Kavanagh as well.
>>>>
>>>> v4: - restructure patchset
>>>> - account for 128B ARM cacheline when sizing mbufs
>>>>
>>>> Mark Kavanagh (4):
>>>> netdev-dpdk: fix mbuf sizing
>>>> dp-packet: Init specific mbuf fields.
>>>> netdev-dpdk: copy large packet to multi-seg. mbufs
>>>> netdev-dpdk: support multi-segment jumbo frames
>>>>
>>>> Michael Qiu (1):
>>>> dp-packet: copy data from multi-seg. DPDK mbuf
>>>>
>>>> Tiago Lam (8):
>>>> dp-packet: Fix allocated size on DPDK init.
>>>> netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>>> dp-packet: Fix data_len handling multi-seg mbufs.
>>>> dp-packet: Handle multi-seg mbufs in helper funcs.
>>>> dp-packet: Handle multi-seg mubfs in shift() func.
>>>> dpdk-tests: Add uni-tests for multi-seg mbufs.
>>>> dpdk-tests: Accept other configs in OVS_DPDK_START
>>>> dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>>
>>>> Documentation/topics/dpdk/jumbo-frames.rst | 52 +++
>>>> Documentation/topics/dpdk/memory.rst | 36 ++
>>>> NEWS | 1 +
>>>> lib/dp-packet.c | 173 +++++++++-
>>>> lib/dp-packet.h | 214 ++++++++++--
>>>> lib/dpdk.c | 8 +
>>>> lib/netdev-dpdk.c | 244 +++++++++++---
>>>> lib/netdev-dpdk.h | 2 +
>>>> tests/automake.mk | 10 +-
>>>> tests/dpdk-packet-mbufs.at | 7 +
>>>> tests/system-dpdk-macros.at | 6 +-
>>>> tests/system-dpdk-testsuite.at | 1 +
>>>> tests/system-dpdk.at | 65 ++++
>>>> tests/test-dpdk-mbufs.c | 513
>>>> +++++++++++++++++++++++++++++
>>>> vswitchd/vswitch.xml | 22 ++
>>>> 15 files changed, 1277 insertions(+), 77 deletions(-)
>>>> create mode 100644 tests/dpdk-packet-mbufs.at
>>>> create mode 100644 tests/test-dpdk-mbufs.c
>>>>
>>
>>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev