Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection

Andrew Rybchenko Wed, 11 Apr 2018 07:03:20 -0700

On 04/11/2018 02:39 PM, Ananyev, Konstantin wrote:

Hi Yongseok,

On Mon, Apr 09, 2018 at 06:04:34PM +0200, Olivier Matz wrote:

Hi Yongseok,

On Tue, Apr 03, 2018 at 05:12:06PM -0700, Yongseok Koh wrote:

On Tue, Apr 03, 2018 at 10:26:15AM +0200, Olivier Matz wrote:

Hi,

On Mon, Apr 02, 2018 at 11:50:03AM -0700, Yongseok Koh wrote:

When attaching a mbuf, indirect mbuf has to point to start of buffer of
direct mbuf. By adding buf_off field to rte_mbuf, this becomes more
flexible. Indirect mbuf can point to any part of direct mbuf by calling
rte_pktmbuf_attach_at().

Possible use-cases could be:
- If a packet has multiple layers of encapsulation, multiple indirect
   buffers can reference different layers of the encapsulated packet.
- A large direct mbuf can even contain multiple packets in series and
   each packet can be referenced by multiple mbuf indirections.

Signed-off-by: Yongseok Koh <[email protected]>

I think the current API is already able to do what you want.

1/ Here is a mbuf m with its data

                off
                <-->
                       len
           +----+   <---------->
           |    |
         +-|----v----------------------+
         | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
         |      -----------------------|
         +-----------------------------+


2/ clone m:

   c = rte_pktmbuf_alloc(pool);
   rte_pktmbuf_attach(c, m);

   Note that c has its own offset and length fields.


                off
                <-->
                       len
           +----+   <---------->
           |    |
         +-|----v----------------------+
         | |    -----------------------|
m       | buf  |    XXXXXXXXXXX      ||
         |      -----------------------|
         +------^----------------------+
                |
           +----+
indirect  |
         +-|---------------------------+
         | |    -----------------------|
c       | buf  |                     ||
         |      -----------------------|
         +-----------------------------+

                 off    len
                 <--><---------->


3/ remove some data from c without changing m

    rte_pktmbuf_adj(c, 10)   // at head
    rte_pktmbuf_trim(c, 10)  // at tail


Please let me know if it fits your needs.

No, it doesn't.

Trimming head and tail with the current APIs removes data and make the space
available. Adjusting packet head means giving more headroom, not shifting the
buffer itself. If m has two indirect mbufs (c1 and c2) and those are pointing to
difference offsets in m,

rte_pktmbuf_adj(c1, 10);
rte_pktmbuf_adj(c2, 20);

then the owner of c2 regard the first (off+20)B as available headroom. If it
wants to attach outer header, it will overwrite the headroom even though the
owner of c1 is still accessing it. Instead, another mbuf (h1) for the outer
header should be linked by h1->next = c2.

Yes, after these operations c1, c2 and m should become read-only. So, to
prepend headers, another mbuf has to be inserted before as you suggest. It
is possible to wrap this in a function rte_pktmbuf_clone_area(m, offset,
length) that will:
   - alloc and attach indirect mbuf for each segment of m that is
     in the range [offset : length+offset].
   - prepend an empty and writable mbuf for the headers

If c1 and c2 are attached with shifting buffer address by adjusting buf_off,
which actually shrink the headroom, this case can be properly handled.

What do you mean by properly handled?

Yes, prepending data or adding data in the indirect mbuf won't override
the direct mbuf. But prepending data or adding data in the direct mbuf m
won't be protected.

 From an application point of view, indirect mbufs, or direct mbufs that
have refcnt != 1, should be both considered as read-only because they
may share their data. How an application can know if the data is shared
or not?

Maybe we need a flag to differentiate mbufs that are read-only
(something like SHARED_DATA, or simply READONLY). In your case, if my
understanding is correct, you want to have indirect mbufs with RW data.

Agree that indirect mbuf must be treated as read-only, Then the current code is
enough to handle that use-case.

And another use-case (this is my actual use-case) is to make a large mbuf have
multiple packets in series. AFAIK, this will also be helpful for some FPGA NICs
because it transfers multiple packets to a single large buffer to reduce PCIe
overhead for small packet traffic like the Multi-Packet Rx of mlx5 does.
Otherwise, packets should be memcpy'd to regular mbufs one by one instead of
indirect referencing.

But just to make HW to RX multiple packets into one mbuf,
data_off inside indirect mbuf should be enough, correct?

Right. Current max buffer len of mbuf is 64kB (16bits) but it is enough for mlx5
to reach to 100Gbps with 64B traffic (149Mpps). I made mlx5 HW put 16 packets in
a buffer. So, it needs ~32kB buffer. Having more bits in length fields would be
better but 16-bit is good enough to overcome the PCIe Gen3 bottleneck in order
to saturate the network link.

There were few complains that 64KB max is a limitation for some use-cases.
I am not against increasing it, but I don't think we have free space on first 
cache-line for that
without another big rework of mbuf layout.
Considering that we need to increase size for buf_len, data_off, data_len, and 
probably priv_size too.

As I understand, what you'd like to achieve with this new field -
ability to manipulate packet boundaries after RX, probably at upper layer.
As Olivier pointed above, that doesn't sound as safe approach - as you have 
multiple
indirect mbufs trying to modify same direct buffer.

I agree that there's an implication that indirect mbuf or mbuf having refcnt > 1
is read-only. What that means, all the entities which own such mbufs have to be
aware of that and keep the principle as DPDK can't enforce the rule and there
can't be such sanity check. In this sense, HW doesn't violate it because the
direct mbuf is injected to HW before indirection. When packets are written by
HW, PMD attaches indirect mbufs to the direct mbuf and deliver those to
application layer with freeing the original direct mbuf (decrement refcnt by 1).
So, HW doesn't touch the direct buffer once it reaches to upper layer.

Yes, I understand that. But as I can see you introduced functions to adjust 
head and tail,
which implies that it should be possible by some entity (upper layer?) to 
manipulate these
indirect mbufs.
And we don't know how exactly it will be done.

The direct buffer will be freed and get available for reuse when all the 
attached
indirect mbufs are freed.

Though if you really need to do that, why it can be achieved by updating 
buf_len and priv_size
Fields for indirect mbufs, straight after attach()?

Good point.
Actually that was my draft (Mellanox internal) version of this patch :-) But I
had to consider a case where priv_size is really given by user. Even though it
is less likely, but if original priv_size is quite big, it can't cover entire
buf_len. For this, I had to increase priv_size to 32-bit but adding another
16bit field (buf_off) looked more plausible.

As I remember, we can't have mbufs bigger then 64K,
so priv_size + buf_len should be always less than 64K, correct?

It sounds like it is suggested to use/customize priv_size to limitindirect mbuf

range in the direct one. It does not work from the box since priv_size is
used to find out direct mbuf by indirect (see rte_mbuf_from_indirect()).

Andrew.

Re: [dpdk-dev] [PATCH v2 1/6] mbuf: add buffer offset field for flexible indirection

Reply via email to