On Thu, 5 Mar 2026 at 00:04, Ilya Maximets <[email protected]> wrote:
>
> On 2/25/26 12:03 PM, David Marchand via dev wrote:
> > By default, DPDK based dp-packets points to data buffers that can't be
> > expanded dynamically.
> > Their layout is as follows:
> > - a minimum 128 bytes headroom chosen at DPDK build time
> > (RTE_PKTMBUF_HEADROOM),
> > - a maximum size chosen at mempool creation,
> >
> > In some usecases though (like encapsulating with multiple tunnels),
> > a 128 bytes headroom is too short.
> >
> > Keep on using mono segment packets but dynamically allocate buffers
> > in DPDK memory and make use of DPDK external buffers API
> > (previously used for userspace TSO).
> >
> > Signed-off-by: David Marchand <[email protected]>
> > ---
> > Changes since v3:
> > - split buffer length calculation in a helper,
> > - handled running test without qdisc (net/tap does not require
> > those qdiscs, but spews ERR level logs if absent),
> > - added check on firewall,
> >
> > Changes since v2:
> > - moved check on uint16_t overflow in netdev_dpdk_extbuf_allocate(),
> >
> > Changes since v1:
> > - fixed new segment length (reset by extbuf attach helper),
> > - added a system-dpdk unit test,
> >
> > ---
> > lib/dp-packet.c | 17 +++++++++++-
> > lib/netdev-dpdk.c | 47 ++++++++++++++++++++++++++++----
> > lib/netdev-dpdk.h | 4 +++
> > tests/system-dpdk.at | 65 ++++++++++++++++++++++++++++++++++++++++++++
> > 4 files changed, 127 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/dp-packet.c b/lib/dp-packet.c
> > index c04d608be6..4c45636039 100644
> > --- a/lib/dp-packet.c
> > +++ b/lib/dp-packet.c
> > @@ -255,8 +255,23 @@ dp_packet_resize(struct dp_packet *b, size_t
> > new_headroom, size_t new_tailroom)
> > new_allocated = new_headroom + dp_packet_size(b) + new_tailroom;
> >
> > switch (b->source) {
> > - case DPBUF_DPDK:
> > + case DPBUF_DPDK: {
> > +#ifdef DPDK_NETDEV
> > + uint32_t buf_len;
> > +
> > + buf_len = netdev_dpdk_extbuf_size(new_allocated);
>
> Shouldn't we assign into new_allocated here? The rte_pktmbuf_attach_extbuf()
> will update the mbuf->buf_len to the result of this call. However, there
> is the dp_packet_set_allocated(b, new_allocated); call that will overwrite
> that value with a potentially smaller 'new_allocated'. I'm not sure if that
> can cause any issues, since the value is smaller, but it doesn't feel right.
>
> Or am I missing something here?
rte_pktmbuf_ext_shinfo_init_helper() stores the struct
rte_mbuf_ext_shared_info object after the actual data.
The computed size here under the name buf_len accounts for the length
that OVS wants + some space for the rte_mbuf_ext_shared_info object
and some alignment constraint.
We can't adjust new_allocated to this value.
I can probably rename the variable.
Now, you are right that OVS may set a smaller value in buf_len later,
because of dp_packet_set_allocated().
I don't think it would be an issue, just that we would be wasting some
space that was allocated because of alignment.
Continuation below...
>
> > + ovs_assert(buf_len <= UINT16_MAX);
> > + new_base = netdev_dpdk_extbuf_allocate(buf_len);
> > + if (!new_base) {
> > + out_of_memory();
> > + }
> > + dp_packet_copy__(b, new_base, new_headroom, new_tailroom);
> > + netdev_dpdk_extbuf_replace(b, new_base, buf_len);
... new_allocated can be adjusted here to pkt->mbuf.buf_len after call
to netdev_dpdk_extbuf_replace() since rte_pktmbuf_attach_extbuf stores
the right available length in mbuf->buf_len.
It is too late to adjust new_tailroom that was used to copy existing
data, but is it an issue?
(btw, why copy head/tailroom data? those locations should be treated
as uninitialized data from my pov)
> > + break;
> > +#else
> > OVS_NOT_REACHED();
> > +#endif
> > + }
> >
> > case DPBUF_MALLOC:
> > if (new_headroom == dp_packet_headroom(b)) {
> > diff --git a/lib/netdev-dpdk.c b/lib/netdev-dpdk.c
> > index 923191da84..cfd641b493 100644
> > --- a/lib/netdev-dpdk.c
> > +++ b/lib/netdev-dpdk.c
> > @@ -3072,12 +3072,51 @@ netdev_dpdk_filter_packet_len(struct netdev_dpdk
> > *dev, struct rte_mbuf **pkts,
> > return cnt;
> > }
> >
> > +uint32_t
> > +netdev_dpdk_extbuf_size(uint32_t data_len)
> > +{
> > + uint32_t buf_len = data_len;
> > +
> > + buf_len += sizeof(struct rte_mbuf_ext_shared_info) + sizeof(uintptr_t);
> > + buf_len = RTE_ALIGN_CEIL(buf_len, sizeof(uintptr_t));
> > +
> > + return buf_len;
> > +}
> > +
> > +void *
> > +netdev_dpdk_extbuf_allocate(uint32_t buf_len)
> > +{
> > + return rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > +}
> > +
> > static void
> > netdev_dpdk_extbuf_free(void *addr OVS_UNUSED, void *opaque)
> > {
> > rte_free(opaque);
> > }
> >
> > +void
> > +netdev_dpdk_extbuf_replace(struct dp_packet *b, void *buf, uint32_t
> > data_len)
> > +{
> > + struct rte_mbuf *pkt = (struct rte_mbuf *) b;
> > + struct rte_mbuf_ext_shared_info *shinfo;
> > + uint16_t buf_len = data_len;
> > +
> > + shinfo = rte_pktmbuf_ext_shinfo_init_helper(buf, &buf_len,
> > + netdev_dpdk_extbuf_free,
> > + buf);
> > + ovs_assert(shinfo != NULL);
> > +
> > + if (RTE_MBUF_HAS_EXTBUF(pkt)) {
> > + rte_pktmbuf_detach_extbuf(pkt);
> > + }
> > + rte_pktmbuf_attach_extbuf(pkt, buf, rte_malloc_virt2iova(buf), buf_len,
> > + shinfo);
> > + /* OVS only supports mono segment.
> > + * Packet size did not change, restore the current segment length. */
> > + pkt->data_len = pkt->pkt_len;
> > +}
> > +
> > static struct rte_mbuf *
> > dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt, uint32_t data_len)
> > {
> > @@ -3086,16 +3125,14 @@ dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt,
> > uint32_t data_len)
> > uint16_t buf_len;
> > void *buf;
> >
> > - total_len += sizeof *shinfo + sizeof(uintptr_t);
> > - total_len = RTE_ALIGN_CEIL(total_len, sizeof(uintptr_t));
> > -
> > + total_len = netdev_dpdk_extbuf_size(total_len);
> > if (OVS_UNLIKELY(total_len > UINT16_MAX)) {
> > VLOG_ERR("Can't copy packet: too big %u", total_len);
> > return NULL;
> > }
> >
> > buf_len = total_len;
> > - buf = rte_malloc(NULL, buf_len, RTE_CACHE_LINE_SIZE);
> > + buf = netdev_dpdk_extbuf_allocate(buf_len);
> > if (OVS_UNLIKELY(buf == NULL)) {
> > VLOG_ERR("Failed to allocate memory using rte_malloc: %u",
> > buf_len);
> > return NULL;
> > @@ -3106,7 +3143,7 @@ dpdk_pktmbuf_attach_extbuf(struct rte_mbuf *pkt,
> > uint32_t data_len)
> > netdev_dpdk_extbuf_free,
> > buf);
> > if (OVS_UNLIKELY(shinfo == NULL)) {
> > - rte_free(buf);
> > + netdev_dpdk_extbuf_free(NULL, buf);
> > VLOG_ERR("Failed to initialize shared info for mbuf while "
> > "attempting to attach an external buffer.");
> > return NULL;
> > diff --git a/lib/netdev-dpdk.h b/lib/netdev-dpdk.h
> > index e6779d478a..0029372ee3 100644
> > --- a/lib/netdev-dpdk.h
> > +++ b/lib/netdev-dpdk.h
> > @@ -32,6 +32,10 @@ struct netdev;
> > void netdev_dpdk_register(const struct smap *);
> > void free_dpdk_buf(struct dp_packet *);
> >
> > +uint32_t netdev_dpdk_extbuf_size(uint32_t);
> > +void *netdev_dpdk_extbuf_allocate(uint32_t);
> > +void netdev_dpdk_extbuf_replace(struct dp_packet *, void *, uint32_t);
>
> nit: Please, name the arguments with the non-descriptive types.
Will do.
>
> > +
> > bool netdev_dpdk_flow_api_supported(struct netdev *, bool check_only);
> >
> > int
> > diff --git a/tests/system-dpdk.at b/tests/system-dpdk.at
> > index 17d3d25955..57d20acf3b 100644
> > --- a/tests/system-dpdk.at
> > +++ b/tests/system-dpdk.at
> > @@ -976,3 +976,68 @@ AT_CHECK([ovs-appctl --format json --pretty
> > dpif/offload/show], [0], [dnl
> > OVS_DPDK_STOP_VSWITCHD
> > AT_CLEANUP
> > dnl
> > --------------------------------------------------------------------------
> > +
> > +dnl
> > --------------------------------------------------------------------------
> > +dnl Test headroom expansion.
> > +AT_SETUP([OVS-DPDK - headroom expansion])
> > +AT_KEYWORDS([dpdk])
> > +OVS_CHECK_FIREWALL()
> > +OVS_DPDK_PRE_CHECK()
> > +OVS_DPDK_START([--no-pci])
>
> I think, this test deserves an explanation of the topology or a diagram.
>
> The naming scheme also can be cleaned up, as it's hard to follow what is
> what. AFAIU, we have tap0, vxlan3 and vxlan1 in the test namespace and
> we have corresponding br_underlay1, br_underlay0 and br0 in the root.
>
> br_underlay1 and tap0 are in fc00:0::/64 network and are directly connected.
> br_underlay0 and vxlan3 are terminating first vxlan tunnel within fc00:1::/64.
> br0 and vxlan1 are terminating the second vxlan tunnel within fc00:2::/64.
>
> This naming scheme doesn't make a lot of sense to me and is very confusing.
Other vxlan test has no diagram, and I copied the names from other vxlan test.
I'll update this new test with your suggestion.
>
> I'd suggest to rename:
>
> br_underlay1 --> br0 (fc00:0::100)
> br_underlay0 --> br1 (fc00:1::100, remote: fc00:0::1)
> br0 --> br2 (fc00:2::100, remote: fc00:1::1)
> tap1 --> p0 (fc00:0::1)
> vxlan3 --> vxlan1 (fc00:1::1, remote: fc00:0::100)
> vxlan1 --> vxlan2 (fc00:2::1, remote: fc00:1::100)
>
> Also, it may be better to name the OVS interface and the tap port the
> same name to avoid extra confusion.
>
> The test also creates tap2 interface in the root namespace in the same
> network as br0. I don't think that interface plays any role in the setup.
> There is no difference between pinging fc00:2::100 and fc00:2::200.
I read your other comment, I'll double check.
> And I wonder if it's maybe worth doing 3 levels of encaps instead of 2,
> just to be sure. WDYT?
Having 2 levels was hard enough to my brain and two (IPv6) vxlan
encapsulations are 140 bytes.
What are you scared of?
--
David Marchand
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev