On 09.08.2018 11:38, Lam, Tiago wrote:
> Hi Ilya,
>
> On 09/08/2018 09:27, Ilya Maximets wrote:
>> Hmm. I found that this series modifies only dpdk related components
>> and doesn't pay any attention to others.
>> dp_packet API was modified to respect the segmented packets, but
>> there are many places, where code uses packet data directly using
>> only dp_packet_data() pointer and dp_packet_size().
>> For example, at least following functions will read the wrong memory
>> and probably generate segfault:
>>
>
> No, that's not correct. Actually, this has been explained back in June
> [1], where Eelco raised the same concerns. There, I replied with:
>
> "This series takes the approach of not allocating new mbufs in
> `DPBUF_DPDK` dp_packets. This is to avoid the case where a
> dp_packet_put() call, for example, allocates two mbufs to create enough
> space, and returns a pointer to the data which isn't contiguous in
> memory (because it is spread across two mbufs). Most of the code in OvS
> that uses the dp_packet API relies on the assumption that memory
> returned is contiguous in memory."
>
> But please do refer to the link for a full explanation. In fact, if that
> was the case most of the tests in system-traffic would fail, and that
> would show a flaw in the design.
>
> Now, obviously this doesn't mean that in the odd case something may have
> slip through the cracks, but all the tests that have been run so far
> have given enough confidence.
>
> Tiago.
>
> [1] https://mail.openvswitch.org/pipermail/ovs-dev/2018-June/348502.html
I don't understand what you're talking about.
Just look at the following case:
1. Applying following patch, just to be sure:
---
diff --git a/lib/netdev-linux.c b/lib/netdev-linux.c
index 0c42268..bc995bc 100644
--- a/lib/netdev-linux.c
+++ b/lib/netdev-linux.c
@@ -1433,6 +1433,11 @@ netdev_linux_tap_batch_send(struct netdev *netdev_,
ssize_t retval;
int error;
+ if (packet->mbuf.nb_segs > 1) {
+ VLOG_ERR_RL(&rl,
+ "Sending multiseg mbuf as a plain data. nb_segs: %d",
+ packet->mbuf.nb_segs);
+ }
do {
retval = write(netdev->tap_fd, dp_packet_data(packet), size);
error = retval < 0 ? errno : 0;
---
2. Configure OVS:
./bin/ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
3. Create one bridge with datapath_type "netdev", add one vhostuserclient
port "vhost1" like this:
# ovs-vsctl show
Bridge "ovsdpdk_br0"
Port "vhost1"
Interface "vhost1"
type: dpdkvhostuserclient
options: {vhost-server-path="/vhost1.sock"}
Port "ovsdpdk_br0"
Interface "ovsdpdk_br0"
type: internal
4. Start VM.
5. Set up on host:
# ovs-vsctl set interface vhost1 mtu_request=9000
# ovs-vsctl set interface ovsdpdk_br0 mtu_request=9000
# ifconfig ovsdpdk_br0 192.168.114.20 netmask 255.255.255.0
6. Set up inside VM:
# ifconfig eth0 192.168.114.10 netmask 255.255.255.0 mtu 9000
7. Try to scp large file from VM to Host:
# scp file [email protected]:./
file 13% 2112KB 434.8KB/s - *stalled*
At the same time in tcpdump on the host side:
14:17:51.162168 IP (tos 0x8, ttl 64, id 30567, offset 0, flags [DF], proto TCP
(6), length 9000)
192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc831
(incorrect -> 0x1be9), seq 480:9428, ack 761, win 256, options [nop,nop,TS val
216598 ecr 1685821578], length 8948
14:17:51.582171 IP (tos 0x8, ttl 64, id 30568, offset 0, flags [DF], proto TCP
(6), length 9000)
192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc807
(incorrect -> 0x980f), seq 480:9428, ack 761, win 256, options [nop,nop,TS val
216640 ecr 1685821578], length 8948
14:17:52.422175 IP (tos 0x8, ttl 64, id 30569, offset 0, flags [DF], proto TCP
(6), length 9000)
192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc7b3
(incorrect -> 0x1b6b), seq 480:9428, ack 761, win 256, options [nop,nop,TS val
216724 ecr 1685821578], length 8948
14:17:54.115128 IP (tos 0x8, ttl 64, id 30570, offset 0, flags [DF], proto TCP
(6), length 9000)
192.168.114.10.43648 > 192.168.114.20.ssh: Flags [.], cksum 0xc70b
(incorrect -> 0x9713), seq 480:9428, ack 761, win 256, options [nop,nop,TS val
216892 ecr 1685821578], length 8948
And in OVS log:
2018-08-09T11:17:50Z|00002|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 4
2018-08-09T11:17:50Z|00003|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 5
2018-08-09T11:17:50Z|00004|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 4
2018-08-09T11:17:50Z|00005|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 5
2018-08-09T11:17:50Z|00006|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 5
2018-08-09T11:17:51Z|00007|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 5
2018-08-09T11:17:51Z|00008|netdev_linux(pmd8)|ERR|Sending multiseg mbuf as a
plain data. nb_segs: 5
SCP never finished because of corrupted data.
Same test works fine if other_config:dpdk-multi-seg-mbufs=false.
Best regards, Ilya Maximets.
>
>> * All the netdev-{linux, dummy, bsd} send functions, because
>> they are using code like this:
>>
>> write(netdev->tap_fd, dp_packet_data(packet), dp_packet_size(packet));
>>
>> * Tunnel extracting functions like 'udp_extract_tnl_md', because
>> they're trying to calculate packet checksums using raw calls to
>> 'csum_continue'.
>>
>> * CONTROLLER_UPCALL will aslo be broken, because it uses direct xmemdup.
>>
>> * Possibly, much more functions.
>>
>> If It's true, when I'm afraid that this patch set is far from being
>> ready for merge.
>>
>> On 25.07.2018 17:19, Tiago Lam wrote:
>>> Overview
>>> ========
>>> This patchset introduces support for multi-segment mbufs to OvS-DPDK.
>>> Multi-segment mbufs are typically used when the size of an mbuf is
>>> insufficient to contain the entirety of a packet's data. Instead, the
>>> data is split across numerous mbufs, each carrying a portion, or
>>> 'segment', of the packet data. Mbufs are chained via their 'next'
>>> attribute (an mbuf pointer).
>>>
>>> The main motivation behind the support for multi-segment mbufs is to
>>> later introduce TSO (use case i. below) / GRO in OvS-DPDK, which is
>>> planned to be introduced after this series.
>>>
>>> Use Cases
>>> =========
>>> i. Handling oversized (guest-originated) frames, which are marked
>>> for hardware accelration/offload (TSO, for example).
>>>
>>> Packets which originate from a non-DPDK source may be marked for
>>> offload; as such, they may be larger than the permitted ingress
>>> interface's MTU, and may be stored in an oversized dp-packet. In
>>> order to transmit such packets over a DPDK port, their contents
>>> must be copied to a DPDK mbuf (via dpdk_do_tx_copy). However, in
>>> its current implementation, that function only copies data into
>>> a single mbuf; if the space available in the mbuf is exhausted,
>>> but not all packet data has been copied, then it is lost.
>>> Similarly, when cloning a DPDK mbuf, it must be considered
>>> whether that mbuf contains multiple segments. Both issues are
>>> resolved within this patchset.
>>>
>>> ii. Handling jumbo frames.
>>>
>>> While OvS already supports jumbo frames, it does so by increasing
>>> mbuf size, such that the entirety of a jumbo frame may be handled
>>> in a single mbuf. This is certainly the preferred, and most
>>> performant approach (and remains the default).
>>>
>>> Enabling multi-segment mbufs
>>> ============================
>>> Multi-segment and single-segment mbufs are mutually exclusive, and the
>>> user must decide on which approach to adopt on init. The introduction
>>> of a new OVSDB field, 'dpdk-multi-seg-mbufs', facilitates this.
>>>
>>> This is a global boolean value, which determines how jumbo frames are
>>> represented across all DPDK ports. In the absence of a user-supplied
>>> value, 'dpdk-multi-seg-mbufs' defaults to false, i.e. multi-segment
>>> mbufs must be explicitly enabled / single-segment mbufs remain the
>>> default.
>>>
>>> Setting the field is identical to setting existing DPDK-specific OVSDB
>>> fields:
>>>
>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-init=true
>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-lcore-mask=0x10
>>> ovs-vsctl set Open_vSwitch . other_config:dpdk-socket-mem=4096,0
>>> ==> ovs-vsctl set Open_vSwitch . other_config:dpdk-multi-seg-mbufs=true
>>>
>>> Performance notes (based on v8)
>>> =================
>>> In order to test for regressions in performance, tests were run on top
>>> of master 88125d6 and v8 of this patchset, both with the multi-segment
>>> mbufs option enabled and disabled.
>>>
>>> VSperf was used to run the phy2phy_cont and pvp_cont tests with varying
>>> packet sizes of 64B, 1500B and 7000B, on a 10Gbps interface.
>>>
>>> Test | Size | Master | Multi-seg disabled | Multi-seg enabled
>>> -------------------------------------------------------------
>>> p2p | 64 | ~22.7 | ~22.65 | ~18.3
>>> p2p | 1500 | ~1.6 | ~1.6 | ~1.6
>>> p2p | 7000 | ~0.36 | ~0.36 | ~0.36
>>> pvp | 64 | ~6.7 | ~6.7 | ~6.3
>>> pvp | 1500 | ~1.6 | ~1.6 | ~1.6
>>> pvp | 7000 | ~0.36 | ~0.36 | ~0.36
>>>
>>> Packet size is in bytes, while all packet rates are reported in mpps
>>> (aggregated).
>>>
>>> No noticeable regression has been observed (certainly everything is
>>> within the ± 5% margin of existing performance), aside from the 64B
>>> packet size case when multi-segment mbuf is enabled. This is
>>> expected, however, because of how Tx vectoriszed functions are
>>> incompatible with multi-segment mbufs on some PMDs. The PMD under
>>> use during these tests was the i40e (on a Intel X710 NIC), which
>>> indeed doesn't support vectorized Tx functions with multi-segment
>>> mbufs.
>>>
>>> ---
>>> v7: - Rebase on master 024810c ("Prepare for post-2.10.0 (2.10.90).");
>>> - Add Ben's proposed fix for automake's warning;
>>> - Add a note to cover letter to explain this is preperatory work for
>>> TSO /
>>> GRO.
>>>
>>> v6: - Rebase on master d1b235d ("tests: Add test for ovn-nbctl's command
>>> parser
>>> error paths.");
>>> - Address Darrell's comments:
>>> - The changes in dp_packet_resize__() were trying to alleviate the
>>> call
>>> to OVS_NOT_REACHED() for DPDK packets, by trying to reuse the
>>> available
>>> tailroom space when no more headroom space is available, and
>>> vice-versa.
>>> However, this was breaking the API for the dp_packet_resize__()
>>> function (only called in dp_packet_prealloc_tailroom() and
>>> dp_packet_prealloc_headroom()), which doesn't seem to suit the
>>> purposes
>>> for DPDK packets.
>>> Instead, and because this is isolate funtionality, revert to the
>>> previous state where dp_packet_resize__() is not supported for DPDK
>>> packets. Hence, then patch 08/14 has been dropped.
>>> - Additionally, fix the tests that were relying on the removed
>>> functionality.
>>>
>>> v5: - Rebase on master 030958a0cc ("conntrack: Fix conn_update_state_alg use
>>> after free.");
>>> - Address Eelco's comments:
>>> - Remove dpdk_mp_sweep() call in netdev_dpdk_mempool_configure(), a
>>> leftover from rebase. Only call should be in dpdk_mp_get();
>>> - Remove NEWS line added by mistake during rebase (about adding
>>> experimental vhost zero copy support).
>>> - Address Ian's comments:
>>> - Drop patch 01 from previous series entirely;
>>> - Patch (now) 01/14 adds a new call to dpdk_buf_size() inside
>>> dpdk_mp_create() to get the correct "mbuf_size" to be used;
>>> - Patch (now) 11/14 modifies dpdk_mp_create() to check if
>>> multi-segment
>>> mbufs is enabled, in which case it calculates the new "mbuf_size"
>>> to be
>>> used;
>>> - In free_dpdk_buf() and dpdk_buf_alloc(), don't lock and unlock
>>> conditionally.
>>> - Add "per-port-memory=true" to test "Multi-segment mbufs Tx" as the
>>> current
>>> DPDK set up in system-dpdk-testsuite can't handle higher MTU sizes
>>> using
>>> the shared mempool model (runs out of memory);
>>> - Add new examples for when multi-segment mbufs are enabled in
>>> topics/dpdk/memory.rst, and a reference to
>>> topics/dpdk/jumbo-frames.rst
>>> (now patch 11/14).
>>>
>>> v4: - Rebase on master b22fb00 ("ovn-nbctl: Clarify error messages in
>>> qos-add
>>> command."):
>>> - A new patch (now 01/15) has been introduced to differentiate between
>>> MTU and mbuf size when creating the mempool. This is because as
>>> part of
>>> the new support for both per port and shared mempools, mempools were
>>> being reused based on the mbuf size, and for multi-segment mbufs the
>>> mbuf size can end up being the same for all mbufs;
>>> - A couple of other patches (now 02/15 and 12/15) have been modified
>>> as
>>> part of the rebase, but only to adapt to the code changes to
>>> "Support
>>> both shared and per port mempools.", no functionality should have
>>> been
>>> changed.
>>>
>>> v3:
>>> - Address Eelco's comment:
>>> - Fix the ovs_assert() introduced in v2 in __packet_set_data(),
>>> which
>>> wasn't correctly asserting that the passed 'v' was smaller than
>>> the
>>> first mbuf's buf_len.
>>>
>>> v2:
>>> - Rebase on master e7cd8cf ("db-ctl-base: Don't die in cmd_destroy() on
>>> error.");
>>> - Address Eelco's comments:
>>> - In free_dpdk_buf(), use mbuf's struct address in dp_packet
>>> instead of
>>> casting;
>>> - Remove unneeded variable in dp_packet_set_size(), pointing to the
>>> first mbuf in the chain;
>>> - Assert in dp_packet_set_size() to enforce that "pkt_len == v" is
>>> always true for DPBUF_DPDK packets;
>>> - Assert in __packet_set_data() to enforce that data_off never goes
>>> beyond the first mbuf in the chain.
>>>
>>> v1:
>>> - v8 should have been sent as v1 really, as that's usually the approach
>>> followed in OvS. That clearly didn't happen, so restarting the series
>>> now. This also helps making it clear it is no longer an RFC series;
>>> - Rebase on master e461481 ("ofp-meter: Fix ofp_print_meter_flags()
>>> output.");
>>> - Address Eelco's comments:
>>> - Change dp_packet_size() so that for `DPBUF_DPDK` packets their
>>> `pkt_len` and `data_len` can't be set to values bigger than the
>>> available space. Also fix assigment to `data_len` which was
>>> incorrectly being set to just`pkt_len`;
>>> - Improve `nonpmd_mp_mutex` comment with a better explanation as to
>>> why the mutex is needed;
>>> - Fix dp_packet_clear() to not call rte_pktmbuf_reset() for non
>>> `DPBUF_DPDK` packets;
>>> - Dropped `if` clause in dp_packet_l4_size(), keep just the `else`;
>>> - Change dp_packet_clone_with_headroom() to use rte_pktmbuf_read()
>>> for
>>> copying `DPBUF_DPDK` packets' data. Also, change it to return
>>> appropriate and meaningful errors, instead of just "0" or "1";
>>> - Change dpdk_prep_tx_buf() name's to
>>> dpdk_clone_dp_packet_to_mbuf(),
>>> and reuse dp_packet_mbuf_write() instead of manual copy;
>>> - Add note vswitchd/vswitch.xml to make it clear the enabling of
>>> dpdk-multi-seg-mbufs requires a restart;
>>> - Change dpdk_mp_create() to increase # mbufs used under the
>>> multi-segment
>>> mbufs approach;
>>> - Increase test coverage by adding "end-to-end" tests that verify that
>>> "dpdk-multi-seg-mbufs" is disabled by default and that a packet is
>>> successfully sent out;
>>> - Some helper funcs such as dp_packet_tail() and dp_packet_end() were
>>> moved
>>> back to be common between `DPBUF_DPDK` and non `DPBUF_DPDK` packets,
>>> to
>>> minimise changes;
>>> - Add some extra notes to "Performance notes" in jumbo-frames.rst doc,
>>> after further testing;
>>> - Structure changes:
>>> - Drop patch 07/13 which is now unneeded;
>>> - Two more patches added for extra test coverage. This is what
>>> accounts
>>> for the increase in size (+1 patch) in the series.
>>>
>>> v8 (non-RFC):
>>> - Rebase on master 88125d6 ("rhel: remove ovs-sim man page from
>>> temporary directory (also for RHEL)");
>>> - Address Ciara's and Ilya's comments:
>>> - Drop the dp_packet_mbuf_tail() function and use only the
>>> already existing dp_packet_tail();
>>> - Fix bug in dpdk_do_tx_copy() where the error return from
>>> dpdk_prep_tx_buf() was being wrongly checked;
>>> - Use dpdk_buf_alloc() and free_dpdk_buf() instead of
>>> rte_pktmbuf_alloc() and rte_pktmbuf_free();
>>> - Fix some other code style and duplication issues pointed out.
>>> - Refactored dp_packet_shift(), dp_packet_resize__() and
>>> dp_packet_put*() functions to work within the bounds of existing
>>> mbufs only;
>>> - Fix dp_packet_clear() which wasn't correctly clearing / freeing
>>> other mbufs in the chain for chains with more than a single mbuf;
>>> - dp_packet layer functions (such as dp_packet_l3()) now check if
>>> the header is within the first mbuf, when using mbufs;
>>> - Move patch 08/13 to before patch 04/13, since dp_packet_set_size()
>>> was refactored to use free_dpdk_buf();
>>> - Fix wrong rte_memcpy() when performing dp_packet_clone() which was
>>> leading to memory corruption;
>>> - Modified the added tests to account for some of the above changes;
>>> - Run performance tests, compiling results and adding them to the
>>> cover letter;
>>> - Add a multi-seg mbufs explanation to the jumbo-frames.rst doc,
>>> together with a "Performance notes" sub-section reflecting the
>>> findings mentioned above in the cover letter.
>>>
>>> v7: - Rebase on master 5e720da ("erspan: fix invalid erspan version.");
>>> - Address Ilya comments;
>>> - Fix non-DPDK build;
>>> - Serialise the access of non pmds to allocation and free of mbufs
>>> by
>>> using a newly introduced mutex.
>>> - Add a new set of tests that integrates with the recently added DPDK
>>> testsuite. These focus on allocating dp_packets, with a single or
>>> multiple mbufs, from an instantiated mempool and performing several
>>> operations on those, verifying if the data at the end matches what's
>>> expected;
>>> - Fix bugs found by the new tests:
>>> - dp_packet_shift() wasn't taking into account shift lefts;
>>> - dp_packet_resize__() was misusing and miscalculating the tailrooms
>>> and headrooms, ending up calculating the wrong number of segments
>>> that needed allocation;
>>> - An mbuf's end was being miscalculated both in dp_packet_tail,
>>> dp_packet_mbuf_tail() and dp_packet_end();
>>> - dp_packet_set_size() was not updating the number of chained
>>> segments
>>> 'nb_segs';
>>> - Add support for multi-seg mbufs in dp_packet_clear().
>>>
>>> v6: - Rebase on master 7c0cb29 ("conntrack-tcp: Handle tcp session
>>> reuse.");
>>> - Further improve dp_packet_put_uninit() and dp_packet_shift() to
>>> support multi-seg mbufs;
>>> - Add support for multi-seg mbufs in dp_packet_l4_size() and
>>> improve other helper funcs, such as dp_packet_tail() and dp_
>>> packet_tailroom().
>>> - Add support for multi-seg mbufs in dp_packet_put(), dp_packet_
>>> put_zeros(), as well as dp_packet_resize__() - allocating new
>>> mbufs and linking them together;
>>> Restructured patchset:
>>> - Squash patch 5 into patch 6, since they were both related to
>>> copying data while handling multi-seg mbufs;
>>> - Split patch 4 into two separate patches - one that introduces the
>>> changes in helper functions to deal with multi-seg mbufs and
>>> two others for the shift() and put_uninit() functionality;
>>> - Move patch 4 to before patch 3, so that ihelper functions come
>>> before functionality improvement that rely on those helpers.
>>>
>>> v5: - Rebased on master e5e22dc ("datapath-windows: Prevent ct-counters
>>> from getting redundantly incremented");
>>> - Sugesh's comments have been addressed:
>>> - Changed dp_packet_set_data() and dp_packet_set_size() logic to
>>> make them independent of each other;
>>> - Dropped patch 3 now that dp_packet_set_data() and dp_packet_set_
>>> size() are independent;
>>> - dp_packet_clone_with_headroom() now has split functions for
>>> handling DPDK sourced packets and non-DPDK packets;
>>> - Modified various functions in dp-packet.h to account for multi-seg
>>> mbufs - dp_packet_put_uninit(), dp_packet_tail(), dp_packet_tail()
>>> and dp_packet_at();
>>> - Added support for shifting packet data in multi-seg mbufs, using
>>> dp_packet_shift();
>>> - Fixed some minor inconsistencies.
>>>
>>> Note that some of the changes in v5 have been contributed by Mark
>>> Kavanagh as well.
>>>
>>> v4: - restructure patchset
>>> - account for 128B ARM cacheline when sizing mbufs
>>>
>>> Mark Kavanagh (4):
>>> netdev-dpdk: fix mbuf sizing
>>> dp-packet: Init specific mbuf fields.
>>> netdev-dpdk: copy large packet to multi-seg. mbufs
>>> netdev-dpdk: support multi-segment jumbo frames
>>>
>>> Michael Qiu (1):
>>> dp-packet: copy data from multi-seg. DPDK mbuf
>>>
>>> Tiago Lam (8):
>>> dp-packet: Fix allocated size on DPDK init.
>>> netdev-dpdk: Serialise non-pmds mbufs' alloc/free.
>>> dp-packet: Fix data_len handling multi-seg mbufs.
>>> dp-packet: Handle multi-seg mbufs in helper funcs.
>>> dp-packet: Handle multi-seg mubfs in shift() func.
>>> dpdk-tests: Add uni-tests for multi-seg mbufs.
>>> dpdk-tests: Accept other configs in OVS_DPDK_START
>>> dpdk-tests: End-to-end tests for multi-seg mbufs.
>>>
>>> Documentation/topics/dpdk/jumbo-frames.rst | 52 +++
>>> Documentation/topics/dpdk/memory.rst | 36 ++
>>> NEWS | 1 +
>>> lib/dp-packet.c | 173 +++++++++-
>>> lib/dp-packet.h | 214 ++++++++++--
>>> lib/dpdk.c | 8 +
>>> lib/netdev-dpdk.c | 244 +++++++++++---
>>> lib/netdev-dpdk.h | 2 +
>>> tests/automake.mk | 10 +-
>>> tests/dpdk-packet-mbufs.at | 7 +
>>> tests/system-dpdk-macros.at | 6 +-
>>> tests/system-dpdk-testsuite.at | 1 +
>>> tests/system-dpdk.at | 65 ++++
>>> tests/test-dpdk-mbufs.c | 513
>>> +++++++++++++++++++++++++++++
>>> vswitchd/vswitch.xml | 22 ++
>>> 15 files changed, 1277 insertions(+), 77 deletions(-)
>>> create mode 100644 tests/dpdk-packet-mbufs.at
>>> create mode 100644 tests/test-dpdk-mbufs.c
>>>
>
>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev