On 20 September 2016 at 16:01, Bill Fischofer <bill.fischo...@linaro.org>
wrote:

>
>
> On Tue, Sep 20, 2016 at 8:30 AM, Christophe Milard <
> christophe.mil...@linaro.org> wrote:
>
>> Hi,
>>
>> I am here trying to make a summary of what is needed by the driver
>> interface
>> regarding odp packet handling. Will serve as the base for the discussions
>> at connect. Please read and comment... possibly at connect...
>>
>> /Christophe
>>
>> From the driver perspective, the situation is rather simple: what we need
>> is:
>>
>> /* definition of a packet segment descriptor:
>>  * A packet segment is just an area, continuous in virtual address space,
>>  * and continuous in the physical address space -at least when no iommu is
>>  * used, e.g for virtio-. Probably we want to have physical continuity in
>>  * all cases (to avoid handling different cases to start with), but that
>>  * would not take advantage of the remapping that can be done by iommus,
>>  * so it can come with a little performance penalty for iommu cases.
>>
>
> I thought we had discussed and agreed that ODP would assume it is running
> on a platform with IOMMU capability? Are there any non-IOMMU platforms of
> interest that we need to support? If not, then I see no need to make this
> provision. In ODP we already have an odp_packet_seg_t type that represents
> a portion of an odp_packet_t that can be contiguously addressed.
>

yes. we did. but then the focus changed to virtio. there is no iommu
there...


>
>
>>  * Segments are shared among all odp threads (including linux processes),
>>
>
> Might be more precise to simply say "segments are accessible to all odp
> threads". Sharing implies simultaneous access, along with some notion of
> coherence, which is something that probably isn't needed.
>

Tell me if I am wrong, but the default in ODP is that a queue access can be
shared between different ODP thread (there is a flag  to garantee
1thread<->1queue access -and hence to have performance benefit-), but as it
is now, nothing prevent thread A to but something in the TX ring buffer and
thread B to free the TX'ed data when putting its own stuff in the same TX
queue. Same shareability in RX.
With these ODP assumptions, we have to access the segments from  different
ODP threads. I would be very pleased to be wrong here :-)
Maybe I should say that I don't think it is an option to have a context
switch at each driver "access", i.e. I don't see a driver as its own ODP
thread/linux process being accessed by some IPC: For me, any ODPthread
sending/receiving packet will act as a driver (same context).


>
>
>>  * and are guaranteed to be mapped at the same virtual address space in
>>  * all ODP instances (single_va flag in ishm) */
>>
>
> Why is this important? How does Thread A know how a segment is accessible
> by Thread B, and does it care?
>

I am afraid it is with regard to my previous answer. If addresses of
segment (and packets) differ from thread to thread, no reference via shared
pointer will be possible between the ODP threads acting as driver =>loss in
efficiency.

>
>
>>  * Note that this definition just implies that a packet segment is
>> reachable
>>  * by the driver. A segment could actually be part of a HW IO chip in a HW
>>  * accelerated HW.
>>
>
> I think this is the key. All that (should) be needed is for a driver to be
> able to access any segment that it is working with. How it does so would
> seem to be secondary from an architectural perspective.
>

Sure. but we still have to implement something on linux-generic, and to
make it possible for other to do something good.


>
>
>> /* for linux-gen:
>>  * Segment are memory areas.
>>  * In TX, pkt_sgmt_join() put the pointer to the odp packet in the
>> 'odp_private'
>>  * element of the last segment of each packet, so that pkt_sgmt_free()
>>  * can just do nothing when odp_private is NULL and release the complete
>>  * odp packet when not null. Segments allocated with pkt_sgmt_alloc()
>>  * will have their odp_private set to NULL. The name and the 'void*' is
>>  * to make that opaque to the driver interface which really should not
>> care...
>>  * Other ODP implementation could handle that as they wish.
>>
>
> Need to elaborate on this. Currently we have an odp_packet_alloc() API
> that allocates a packet that consists of one or more segments. What seems
> to be new from the driver is the ability to allocate (and free) individual
> segments and then (a) assemble them into odp_packet_t objects or (b) remove
> them from odp_packet_t objects so that they become unaffiliated raw
> segments not associated with any odp_packet_t.
>

Yes a) is definitely needed. We have to be able to allocate segments
without telling which ODP packet they refer to: simply because we cannot
know that at alloc time (at least for some NICs) what packet segment would
relate to which packet: if we put 32 x 2K segments in a RX buffer, this can
result as one single ODP packet using them all (for a 64K jumbo frame) or
32 separate ODP standart packets. We don't know at alloc time
No, b) is not really as you wrote, I think: we have to be able to split the
packet to be transmitted into its physically contiguous components
(assuming no iommu, e.g. for virtio), but it is fully acceptable not to be
able to re-use a TX'd segment, i.e. the segment can die with the ODP
packet, probably when the last segment is freed but not even necesseraly:
an ODP packet having multiple reference to it would keep all its segments
valid untill the last reference diseappear.
So calling pkt_sgmt_free() would do nothing except for the last segment of
a packet (It is reasonable to assume order as segments will be TX'd from
firt to last, I assume). On the last segment, it would decrease the packet
ref count and free the ODP packet (and hence all its segments) when
refcount reaches 0.
This is a bit asymetrical, you are right: segments need to be
packet-relation free when allocated, but this requirement is not needed in
TX, and we can gain performance with this asymmetry, I assume.

>
> So it seems we need a corresponding set of odp_segment_xxx() APIs that
> operate on a new base type: odp_segment_t. An odp_segment_t becomes an
> odp_packet_seg_t when it (and possibly other segments) are converted into
> an odp_packet_t as part of a packet assembly operation. Conversely, an
> odp_packet_seg_t becomes an odp_segment_t when it is disconnected from an
> odp_packet_t.
>

I don't think we have a need for this 2 types (odp_segment_t
and odp_packet_seg_t depending on packet relationship): I think it is
simpler to give only one type for the driver: segment. The driver just see
an alloc() and free() function for those, and a function to Tell ODP which
segment form a packet at RX time. The fact that there is an asymmetry in
the ODP internal implementation of segment does not even need to be visible
to the driver, so I don't see the need to expose these 2 types.


>
>
>>  */
>>
>> typedef uint64_t phy_address_t;
>>
>> typedef struct{
>>         void            *address;
>>         phy_address_t   phy_addr;
>>         uint32_t        len;
>>         void*           odp_private;
>> } pkt_sgmt_t;
>>
>> /* FOR RX: */
>> /* segment allocation function:
>>  * As it is not possible to guarantee physical memory continuity from
>>  * user space, this segment alloc function is best effort:
>>  * The size passed in parameter is a hint of what the most probable
>> received
>>  * packet size could be: this alloc function will allocate a segment
>> whose size
>>  * will be greater or equal to the required size if the latter can fit in
>>  * a single page (or huge page), hence guarateeing the segment physical
>>  * continuity.
>>  * If there is no physical page large enough for 'size' bytes, then
>>  * the largest page is returned, meaning that in that case the allocated
>>  * segment will be smaller than the required size. (the received packet
>>  * will be fragmented in this case).
>>  * This pkt_sgmt_alloc function is called by the driver RX side to
>> populate
>>  * the NIC RX ring buffer(s).
>>  * returns the number of allocated segments (1) on success or 0 on error.
>>  * Note: on unix system with 2K and 2M pages, this means that 2M will get
>>  * allocated for each large (64K?) packet... to much waste? should we
>> handle
>>  * page fragmentation (which would really not change this interface)?
>>  */
>> int pkt_sgmt_alloc(uint32_t size, pkt_sgmt_t *returned_sgmt);
>>
>
> Presumably the driver is maintaining rings of odp_segment_t's that it has
> allocated. Currently allocations are from ODP pools that are of various
> pool types. So one question is do we define a new pool type
> (ODP_POOL_SEGMENT) of this existing odp_pool_t or do we introduce a
> separate odp_segment_pool_t type to represent an aggregate area of memory
> from which odp_segment_t's are drawn?
>

If I remember right, a pktio is given a RX pool at creation time (or maybe
this is now a per queue thing, I have to check). But anyway, it make indeed
sense that a  pkt_sgmt_alloc() would take its segments for the memory
associated with the RX queue.


> Alternately, do we provide an extended set of semantics on the current
> ODP_POOL_PACKET pools? That might be cleaner given how odp_segment_t
> objects need to be convertible to and from odp_packet_seg_t objects. This
> means that either (a) applications must pass an odp_pool_t to the driver as
> part of odp_pktio_open() processing, or (b) the driver itself calls
> odp_pool_create() and communicates that back to the application.
>

I thought that was already the case? where are packet received on pktio put
today? Doesn't this relationship exist already today?


>
>
>>
>> /*
>>  * another variant of the above function could be:
>>  * returns the number of allocated segments on success or 0 on error.
>>  */
>> int pkt_sgmt_alloc_multi(uint32_t size, pkt_sgmt_t *returned_sgmts,
>>                          int* nb_sgmts);
>>
>> /*
>>  * creating ODP packets from the segments:
>>  * Once a series of segments belonging to a single received packet is
>>  * fully received (note that this serie can be of lengh 1 if the received
>>  * packet fitted in a single segment), we need a function to create the
>>  * ODP packet from the list of segments.
>>  * We first define the "pkt_sgmt_hint" structure, which can be used by
>>  * a NIC to pass information about the received packet (the HW probably
>>  * knows a lot about the received packet so the SW does not nesseceraly
>>  * need to reparse it: the hint struct contains info which is already
>> known
>>  * by the HW. If hint is NULL when calling pkt_sgmt_join(), then the SW
>> has
>>  * to reparse the received packet from scratch.
>>  * pkt_sgmt_join() returns 0 on success.
>>  */
>> typedef struct {
>>         /* ethtype, crc_ok, L2 and L3 offset, ip_crc_ok, ... */
>> } pkt_sgmt_hint;
>>
>
> I think the main point here is that the driver may have information
> relating to the received packet that should populate the packet metadata as
> part of creating the odp_packet_t from the individual odp_segment_t's that
> contain the packet data. At a gross level we'd expect this sort of
> processing:
>
> odp_packet_t driver_rx() {
>          ...receive a packet into a collection of odp_segment_t's
>          return odp_packet_assemble(seg_list, seg_count, ...);
> }
>
> Where odp_packet_assemble() would have the signature:
>
> odp_packet_t odp_packet_assemble(odp_segment_t seg_list[], int seg_count,
> ...);
>
> This would be a new API that, unlike odp_packet_alloc(), takes an existing
> list of odp_segment_t's and associated info to return an odp_packet_t.
>
> The driver_rx() interface would be called by the scheduler as it polls the
> pktio and is then processed by issuing the call:
>
> odp_packet_classify(pkt, 0, ODP_PARSE_L2, ODP_NO_REBUFFER);
>
> where this API is as proposed in the pipeline design doc:
>
> int odp_packet_classify(odp_packet_t pkt, int offset,
>
>                        enum odp_packet_layer,
>                        int rebuffer);
>
> This call releases the packet from the scheduler by populating the packet
> metadata and enqueuing it on an appropriate (application) receive queue
> according to its CoS.
>

I think we seem to agree here. This is just making it 2 function call
instead of one. I'like to keep the notion of odp_packet off the driver
knowledge though, as all these are south interface functions (actually I
should have prefixed all my symbols by odpdrv_* to make that clear). But
even it the driver should not know about the odp_packets (and all the stuff
and application using the north interface can do) it will be given the
possibility to tell which segments form a packet and to set some the the
packet attribute. That can be done by 2 separates function calls of
course...


>
>
>>
>> int pkt_sgmt_join(pkt_sgmt_hint *hint,
>>                   pkt_sgmt_t *segments, int nb_segments,
>>                   odp_packet_t *returned_packet);
>>
>> /* another variant of the above, directely passing the packet to a given
>> queue*/
>> int pkt_sgmt_join_and_send(pkt_sgmt_hint *hint,
>>                            pkt_sgmt_t *segments, int nb_segments,
>>                            odp_queue_t *dest_queue);
>>
>
> As noted above, we probably don't want the driver itself concerned with
> destination queues since that's really the job of the classifier.
>

This is actually a tricky thing as part (all?) of the classification can be
done by most nic HW.... To start with I'll need a place to put the packets
to... maybe the entrance of a SW classifier to start with...We'll need to
discuss that... But I am afraid we are touching the top of an uncomfortable
iceberg here...because I assume people will be willing to be using their HW
(NIC) ability to perform classification, so eventually the classification
might go (at least partly) down to the driver...


>
>
>>
>>
>> /* FOR TX: */
>> /*
>>  * Function returning a list of segments making an odp_packet:
>>  * return the number of segments or 0 on error:
>>  * The segments are returned in the segments[] array, whose length will
>>  * never exceed max_nb_segments.
>>  */
>> int pkt_sgmt_get(odp_pool_t *packet, pkt_sgmt_t *segments, int
>> max_nb_segments);
>>
>> /*
>>  * "free" a segment
>>  */
>> /*
>>  * For linux-generic, that would just do nothing, unless
>> segment->odp_private
>>  * is not NULL, in which case the whole ODP packet is freed.
>>  */
>> int pkt_sgmt_free(pkt_sgmt_t *segment);
>> int pkt_sgmt_free_multi(pkt_sgmt_t *segments, int nb_segments);
>>
>>
> We already have APIs for getting and walking the list of
> odp_packet_seg_t's associated with a packet. An API that would do a bulk
> "disassemble" of an odp_packet_t into individual odp_segment_t's would be
> reasonable here. We do need to be careful about freeing, however, as this
> is going to intersect with whatever we do with packet references.
>

I guess my comments above clarify my view on that: first, these new set of
functions belong to the south interface and are unrelated to the north
interface functions (but they could of course share as much as they want
implementationwise).
Hopefully, the asymetry I discussed above clarify my views regarding
freeing the segments.


> Something to explore more fully.
>
> Yes :-)

Thanks for your comments. Hope my answers make sense.

Christophe

Reply via email to