On Wed, May 22, 2024 at 03:26:05PM GMT, Aaron Conole wrote:
> Ilya Maximets <[email protected]> writes:
>
> > On 5/16/24 19:03, Adrian Moreno wrote:
> >>
> >>
> >> On 4/24/24 9:53 PM, Adrian Moreno wrote:
> >>> This is the userspace counterpart of the work being done in the kernel
> >>> [1]. Sending it as RFC to get some early feedback on the overall
> >>> solution.
> >>>
> >>> ** Problem description **
> >>> Currently, OVS supports several observability features, such as
> >>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the
> >>> complexity of OVN-generated pipelines, a single sample is typically not
> >>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need
> >>> highler level metadata alongside the packet sample.
> >>>
> >>> This can be achieved by the means of per-flow IPFIX sampling and
> >>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit
> >>> values (observation_domain_id and observation_point_id) that are sent
> >>> alongside the packet information in the IPFIX frames.
> >>>
> >>> However, there is a fundamental limitation of the existing sampling
> >>> infrastructure, specially in the kernel datapath: samples are handled by
> >>> ovs-vswitchd, forcing the need for an upcall (userspace action). The
> >>> fact that samples share the same netlink sockets and handler thread cpu
> >>> time with actual packet misse, can easily cause packet drops making this
> >>> solution unfit for use in highly loaded production systems.
> >>>
> >>> Users are left with little option than guessing what sampling rate will
> >>> be OK for their traffic pattern and dealing with the lost accuracy.
> >>>
> >>> ** Feature Description **
> >>> In order to solve this situation and enable this feature to be safely
> >>> turned on in production, this RFC uses the psample support proposed in
> >>> [1] to send NXAST_SAMPLE samples to psample adding the observability
> >>> domain and point information as metadata.
> >>>
> >>> ~~ API ~~
> >>> The API is simply a new field called "psample_group" in the
> >>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is
> >>> orthogonal to also associating the FSCS entry to an entry in the IPFIX
> >>> table.
> >>>
> >>> Apart from that configuration, the controller needs to add NXAST_SAMPLE
> >>> actions that refer the entry's id.
> >>>
> >>> ~~ HW Offload ~~
> >>> psample is already supported by tc using the act_sample action. The
> >>> metadata is taken from the actions' cookie.
> >>> This series also adds support for offloading the odp sample action,
> >>> only when it includes psample information but not nested actions.
> >>>
> >>> A bit of special care has to be taken in the tc layer to not store the
> >>> ufid in the sample's cookie as it's used to carry action-specific data.
> >>>
> >>> ~~ Datapath support ~~
> >>> This new behavior of the datapth sample action is only supported on the
> >>> kernel datapath. A more detailed analysis is needed (and planned) to
> >>> find a way to also optimize the userspace datapath. However, although
> >>> potentially inefficient, there is not that much risk of dropping packets
> >>> with the existing IPFIX infrastructure.
> >>>
> >>> ~~ Testing ~~
> >>> The series includes an utility program called "ovs-psample" that listens
> >>> to packets multicasted by the psample module and prints them (also
> >>> printing the obs_domain and obs_point ids). In addition the kernel
> >>> series includes a tracepoint for easy testing and troubleshooting.
> >>>
> >>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473
> >>>
> >>>
> >>
> >> Hi all,
> >>
> >> I had an offline meeting with Eelco, Ilya and Aaron about this topic and 
> >> wanted
> >> to put out what was discussed and hopefully get more feedback.
> >>
> >> In general, 3 options were considered:
> >>
> >> Option A: Reusing the sample action
> >> ===================================
> >> This is essentially what this proposal (and it's kernel counterpart) 
> >> consists
> >> in. The datapath action would look like this:
> >>
> >>   sample(probability=50%, actions(...), group=10, cookie=0x123)
> >>
> >> Pros
> >> ~~~~
> >> - In userspace, it fits nicely with the existing per-flow sampling
> >> infrastructure as this RFC exemplifies.
> >>
> >> - The probability is present in the context of sending the packet to 
> >> psample so
> >> it's easily carried to psample's rate, making the consumer aware of the 
> >> accuracy
> >> of the sample.
> >>
> >> - Relatively easy to implement in tc as a act_sample exists with similar 
> >> semantics.
> >>
> >> Cons
> >> ~~~~
> >> - It breaks the original design of the "sample" action. The "sample" action
> >> means: The packet is sampled (a probability is calculated) and, if the 
> >> result is
> >> positive, a set of nested actions is executed. This follows the
> >> "building-blocks" approach of datapath action. This option breaks this
> >> assumption by adding more semantics and behavior to the "sample" action.
> >>
> >> - If we want to add "trunc" to this sampling, the result would probably 
> >> work but
> >> is not very nice because we need it outside of the sample() action, e.g:
> >>     trunc(100),sample(probability=50%, actions(...), group=10,
> >> cookie=0x123),trunc(0)
> >>
> >> - Makes the uAPI a bit clunky by adding more attributes to an existing, 
> >> simple
> >> action. Also, new attributes and existing nested actions are completely
> >> orthogonal so code needs to exist in both userspace and kernel to properly 
> >> split
> >> this behavior.
> >>
> >> - When "trunc" is used, psample will report the original skb length, 
> >> regardless
> >> of whether the "trunc" is associated to the sample action or not. Not sure 
> >> this
> >> is a huge problem nor if it's easy or worth doing it better.
> >>
> >> Option B: Creating a new action: emit_action
> >> ============================================
> >> Credits to Ilya. Creating a new action to implement this would fit into 
> >> the rest
> >> of the actions yielding a flow such:
> >>
> >>   sample(probability=50%, 
> >> actions(trunc(100),emit_sample(group=10,cookie=0x123)))
> >>
> >> Pros
> >> ~~~~
> >> - It better aligns with datapath action design by reusing existing 
> >> building blocks.
> >>
> >> - Code additions are probably cleaner and easier to review as it's adding 
> >> a new
> >> action instead of changing the behavior of an existing one.
> >>
> >> - Fit with existing per-flow IPFIX sampling is equally good since a 
> >> combined
> >> IPFIX/psample action would just add both "userspace" and "emit_sample" 
> >> actions
> >> will be nested inside "sample".
> >>
> >> - "trunc" can now be inside the "sample" nested actions. It should not 
> >> modify
> >> the packet so, in theory, this should not cause a packet clone.
> >>
> >> Cons
> >> ~~~~
> >> - The probability is now longer available directly from the context of
> >> "emit_sample". We would need to carry the probability as metadata (in 
> >> private
> >> skb section) from the outer "sample" to the inner "emit_sample". This 
> >> mechanism
> >> is not using explicit actions so it should be very clearly documented.
> >>
> >> - offloading to tc might be a bit more complicated as we need the 
> >> probability
> >> for act_sample. We would need to introspect the "sample" actions to see if 
> >> it
> >> _only_ contains "emit_sample" and only then replace it with act_sample.
> >
> > This is not a real con, because TC doesn't have anything like a generic
> > sample() action with an action list.  And offload of sample() action is
> > not supported today.  We will always have to introspect the content of
> > the sample() action's action list, because we'll be able to offload only
> > a very narrow subset of these inner actions.  Basically, emit_sample()
> > will be an only offloadable option, since sFlow-via-psample approach
> > doesn't seem to be possible.
> >
> >>
> >>
> >> Option C: Using a special vport
> >> ===============================
> >> Credits to Aaron: Adding a special vport type that sends packets to
> >> psample. The
> >> action would look like this:
> >>    sample(probability=50%, actions(trunc(100),output(pX))
> >>
> >> Pros
> >> ~~~~
> >> - It also aligns well with the datapath building blocks. It is very clear
> >> semantically: "send it to observation"
> >>
> >> - From ovs-vswitchd pov, it resembles per-flow mirroring which is
> >> well integrated.
> >
> > Note: there is no such thing as per-flow mirroring today.  But this would
> > be an implementation of it.
> >
> >>
> >> - Statistics are generated in the kernel as other vports.
> >>
> >> Cons
> >> ~~~~
> >> - The probability is not present in the context of output, nor the output 
> >> has
> >> this concept currently. We would need to carry it over in the skb
> >> private area,
> >> same as for Option B.
> >> However, documenting this out-of-bands carry over of the sampling 
> >> probability
> >> and its interaction with other actions can be cleaner if this mechanism 
> >> only
> >> affects a new action than an existing one (specially if it depends on the
> >> underlying vport type).
> >>
> >> - The group and cookie is not on the output action either so we need new
> >> actions, lets call it SET_OBSERVABILITY_METADATA and a new
> >> non-masked key field
> >> that keeps that metadata around for the action to use it.
>
> Well that, or we could just attach those details in the OVS_CB on
> sample() call.
>

That's right. BTW, do you know if these fields survive a pass through
conntrack? I guess they do but conntrack could rewrite that area and I
don't see we're saving & restoring it.

> >> - Currently, "trunc" is implemented (i.e: packet actually gets
> >> trimmed) before
> >> the per-vport-type code is reached, causing the packet to be cloned
> >> inside the
> >> "sample" action. The plan for direct psample execution was to add an
> >> optimization that would memory copies if there were are no listeners in the
> >> multicast group. This nulls-out this potential optimization.
>
> This is somewhat buried here, but also I don't know that it is so bad.
> It will need some measurement, but I don't think we do a deep clone, so
> it shouldn't actually have too much performance hit (but skb allocations
> are expensive compared to not allocating them).
>

That's right, it should be a shallow copy.

> >> - Although not super-expensive (if skb is not modified), it _always_
> >> requires an
> >> skb_clone (even without "trunc").
> >>
> >> - It requires a very big amount of work in ovs-vswitchd:
> >>    - We'd need to decide who creates the port, if it's the user via
> >> Openflow and
> >> it's exposed as a port in a bridge or if it's a hidden vport created
> >> by the dpif
> >> layer.
> >>    - Controls need to be established to limit OFP actions to send traffic 
> >> to
> >> this port or it receiving traffic.
> >>    - DPDK datapath would probably require a new netdev_class as well.
> >>
> >> - tc offload is more complicated. If we want to use act_sample (I
> >> cannot think
> >> of a way that doesn't involve act_sample), we'd need to track
> >> "sample" actions
> >> and "set_observability_metadata) to be able to build an act_sample for 
> >> this.
>
> TC support would likely need its own separate discussion anyway, because
> sample action (the way OVS module uses it) doesn't directly map in TC,
> IIUC.
>
> >> Ilya, Aaron, Eelco, did I miss something?
> >>
> >> Thoughts? Preferences?
> >
> > I think, option C is a little too heavy for the end functionality for the 
> > user.
> > And I'm not sure if we'll be able to reuse this infrastructure for
> > anything else.
>
> I guess this is somewhat of a precedent setting feature for sampling to
> different backends.  For example, if tomorrow there is another sample
> interface for, ex. dpdk.  If this is implemented as emit_sample()
> action, we can implement this per-datapath.  But maybe something else
> comes that would be datapath agnostic and we need some new action for
> that as well.  So I was thinking perhaps it makes sense to use an analog
> to mirror that looks "natural."
>
> BUT, that may never happen and if it doesn't, then as you note, we do
> lots of infra work for a feature that only has one user.
>
> I'm not convinced that there won't be a user, because new interfaces pop
> up all the time.  But I also don't feel so strongly that it needs to be
> done this way.  The downside of implementing B is that the second one
> that comes will require more actions to be implemented.  But we will
> still need more vport types to be implemented instead if option C.  So I
> don't see too much benefit there.
>
> From a users perspective, debugging with interfaces is maybe a bit
> easier.  For example, we can track some statistics and support some
> additional kinds of controls (like up/down of the interface to turn
> samples on/off without needing to rewrite the flows in kernel).  But
> probably the real end user of the solution would be using something else
> like retis or some other fancy UI that hides all the details anyway.
>
> > I'm not a fan of the design breaking part of the A and duplication of
> > attributes.
>
> +1
>
> > So, I'd prefer the B here, but I'm a little biased.
>
> I'm okay with either B or C approach.
>

Thanks you all very for your feedback.

Option B appears as a viable way forward. I'll continue working on that
direction.

--
Adrián

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to