On Wed, May 22, 2024 at 03:26:05PM GMT, Aaron Conole wrote: > Ilya Maximets <[email protected]> writes: > > > On 5/16/24 19:03, Adrian Moreno wrote: > >> > >> > >> On 4/24/24 9:53 PM, Adrian Moreno wrote: > >>> This is the userspace counterpart of the work being done in the kernel > >>> [1]. Sending it as RFC to get some early feedback on the overall > >>> solution. > >>> > >>> ** Problem description ** > >>> Currently, OVS supports several observability features, such as > >>> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the > >>> complexity of OVN-generated pipelines, a single sample is typically not > >>> enough to troubleshoot what is OVN/OVS is doing with the packet. We need > >>> highler level metadata alongside the packet sample. > >>> > >>> This can be achieved by the means of per-flow IPFIX sampling and > >>> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit > >>> values (observation_domain_id and observation_point_id) that are sent > >>> alongside the packet information in the IPFIX frames. > >>> > >>> However, there is a fundamental limitation of the existing sampling > >>> infrastructure, specially in the kernel datapath: samples are handled by > >>> ovs-vswitchd, forcing the need for an upcall (userspace action). The > >>> fact that samples share the same netlink sockets and handler thread cpu > >>> time with actual packet misse, can easily cause packet drops making this > >>> solution unfit for use in highly loaded production systems. > >>> > >>> Users are left with little option than guessing what sampling rate will > >>> be OK for their traffic pattern and dealing with the lost accuracy. > >>> > >>> ** Feature Description ** > >>> In order to solve this situation and enable this feature to be safely > >>> turned on in production, this RFC uses the psample support proposed in > >>> [1] to send NXAST_SAMPLE samples to psample adding the observability > >>> domain and point information as metadata. > >>> > >>> ~~ API ~~ > >>> The API is simply a new field called "psample_group" in the > >>> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is > >>> orthogonal to also associating the FSCS entry to an entry in the IPFIX > >>> table. > >>> > >>> Apart from that configuration, the controller needs to add NXAST_SAMPLE > >>> actions that refer the entry's id. > >>> > >>> ~~ HW Offload ~~ > >>> psample is already supported by tc using the act_sample action. The > >>> metadata is taken from the actions' cookie. > >>> This series also adds support for offloading the odp sample action, > >>> only when it includes psample information but not nested actions. > >>> > >>> A bit of special care has to be taken in the tc layer to not store the > >>> ufid in the sample's cookie as it's used to carry action-specific data. > >>> > >>> ~~ Datapath support ~~ > >>> This new behavior of the datapth sample action is only supported on the > >>> kernel datapath. A more detailed analysis is needed (and planned) to > >>> find a way to also optimize the userspace datapath. However, although > >>> potentially inefficient, there is not that much risk of dropping packets > >>> with the existing IPFIX infrastructure. > >>> > >>> ~~ Testing ~~ > >>> The series includes an utility program called "ovs-psample" that listens > >>> to packets multicasted by the psample module and prints them (also > >>> printing the obs_domain and obs_point ids). In addition the kernel > >>> series includes a tracepoint for easy testing and troubleshooting. > >>> > >>> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473 > >>> > >>> > >> > >> Hi all, > >> > >> I had an offline meeting with Eelco, Ilya and Aaron about this topic and > >> wanted > >> to put out what was discussed and hopefully get more feedback. > >> > >> In general, 3 options were considered: > >> > >> Option A: Reusing the sample action > >> =================================== > >> This is essentially what this proposal (and it's kernel counterpart) > >> consists > >> in. The datapath action would look like this: > >> > >> sample(probability=50%, actions(...), group=10, cookie=0x123) > >> > >> Pros > >> ~~~~ > >> - In userspace, it fits nicely with the existing per-flow sampling > >> infrastructure as this RFC exemplifies. > >> > >> - The probability is present in the context of sending the packet to > >> psample so > >> it's easily carried to psample's rate, making the consumer aware of the > >> accuracy > >> of the sample. > >> > >> - Relatively easy to implement in tc as a act_sample exists with similar > >> semantics. > >> > >> Cons > >> ~~~~ > >> - It breaks the original design of the "sample" action. The "sample" action > >> means: The packet is sampled (a probability is calculated) and, if the > >> result is > >> positive, a set of nested actions is executed. This follows the > >> "building-blocks" approach of datapath action. This option breaks this > >> assumption by adding more semantics and behavior to the "sample" action. > >> > >> - If we want to add "trunc" to this sampling, the result would probably > >> work but > >> is not very nice because we need it outside of the sample() action, e.g: > >> trunc(100),sample(probability=50%, actions(...), group=10, > >> cookie=0x123),trunc(0) > >> > >> - Makes the uAPI a bit clunky by adding more attributes to an existing, > >> simple > >> action. Also, new attributes and existing nested actions are completely > >> orthogonal so code needs to exist in both userspace and kernel to properly > >> split > >> this behavior. > >> > >> - When "trunc" is used, psample will report the original skb length, > >> regardless > >> of whether the "trunc" is associated to the sample action or not. Not sure > >> this > >> is a huge problem nor if it's easy or worth doing it better. > >> > >> Option B: Creating a new action: emit_action > >> ============================================ > >> Credits to Ilya. Creating a new action to implement this would fit into > >> the rest > >> of the actions yielding a flow such: > >> > >> sample(probability=50%, > >> actions(trunc(100),emit_sample(group=10,cookie=0x123))) > >> > >> Pros > >> ~~~~ > >> - It better aligns with datapath action design by reusing existing > >> building blocks. > >> > >> - Code additions are probably cleaner and easier to review as it's adding > >> a new > >> action instead of changing the behavior of an existing one. > >> > >> - Fit with existing per-flow IPFIX sampling is equally good since a > >> combined > >> IPFIX/psample action would just add both "userspace" and "emit_sample" > >> actions > >> will be nested inside "sample". > >> > >> - "trunc" can now be inside the "sample" nested actions. It should not > >> modify > >> the packet so, in theory, this should not cause a packet clone. > >> > >> Cons > >> ~~~~ > >> - The probability is now longer available directly from the context of > >> "emit_sample". We would need to carry the probability as metadata (in > >> private > >> skb section) from the outer "sample" to the inner "emit_sample". This > >> mechanism > >> is not using explicit actions so it should be very clearly documented. > >> > >> - offloading to tc might be a bit more complicated as we need the > >> probability > >> for act_sample. We would need to introspect the "sample" actions to see if > >> it > >> _only_ contains "emit_sample" and only then replace it with act_sample. > > > > This is not a real con, because TC doesn't have anything like a generic > > sample() action with an action list. And offload of sample() action is > > not supported today. We will always have to introspect the content of > > the sample() action's action list, because we'll be able to offload only > > a very narrow subset of these inner actions. Basically, emit_sample() > > will be an only offloadable option, since sFlow-via-psample approach > > doesn't seem to be possible. > > > >> > >> > >> Option C: Using a special vport > >> =============================== > >> Credits to Aaron: Adding a special vport type that sends packets to > >> psample. The > >> action would look like this: > >> sample(probability=50%, actions(trunc(100),output(pX)) > >> > >> Pros > >> ~~~~ > >> - It also aligns well with the datapath building blocks. It is very clear > >> semantically: "send it to observation" > >> > >> - From ovs-vswitchd pov, it resembles per-flow mirroring which is > >> well integrated. > > > > Note: there is no such thing as per-flow mirroring today. But this would > > be an implementation of it. > > > >> > >> - Statistics are generated in the kernel as other vports. > >> > >> Cons > >> ~~~~ > >> - The probability is not present in the context of output, nor the output > >> has > >> this concept currently. We would need to carry it over in the skb > >> private area, > >> same as for Option B. > >> However, documenting this out-of-bands carry over of the sampling > >> probability > >> and its interaction with other actions can be cleaner if this mechanism > >> only > >> affects a new action than an existing one (specially if it depends on the > >> underlying vport type). > >> > >> - The group and cookie is not on the output action either so we need new > >> actions, lets call it SET_OBSERVABILITY_METADATA and a new > >> non-masked key field > >> that keeps that metadata around for the action to use it. > > Well that, or we could just attach those details in the OVS_CB on > sample() call. >
That's right. BTW, do you know if these fields survive a pass through conntrack? I guess they do but conntrack could rewrite that area and I don't see we're saving & restoring it. > >> - Currently, "trunc" is implemented (i.e: packet actually gets > >> trimmed) before > >> the per-vport-type code is reached, causing the packet to be cloned > >> inside the > >> "sample" action. The plan for direct psample execution was to add an > >> optimization that would memory copies if there were are no listeners in the > >> multicast group. This nulls-out this potential optimization. > > This is somewhat buried here, but also I don't know that it is so bad. > It will need some measurement, but I don't think we do a deep clone, so > it shouldn't actually have too much performance hit (but skb allocations > are expensive compared to not allocating them). > That's right, it should be a shallow copy. > >> - Although not super-expensive (if skb is not modified), it _always_ > >> requires an > >> skb_clone (even without "trunc"). > >> > >> - It requires a very big amount of work in ovs-vswitchd: > >> - We'd need to decide who creates the port, if it's the user via > >> Openflow and > >> it's exposed as a port in a bridge or if it's a hidden vport created > >> by the dpif > >> layer. > >> - Controls need to be established to limit OFP actions to send traffic > >> to > >> this port or it receiving traffic. > >> - DPDK datapath would probably require a new netdev_class as well. > >> > >> - tc offload is more complicated. If we want to use act_sample (I > >> cannot think > >> of a way that doesn't involve act_sample), we'd need to track > >> "sample" actions > >> and "set_observability_metadata) to be able to build an act_sample for > >> this. > > TC support would likely need its own separate discussion anyway, because > sample action (the way OVS module uses it) doesn't directly map in TC, > IIUC. > > >> Ilya, Aaron, Eelco, did I miss something? > >> > >> Thoughts? Preferences? > > > > I think, option C is a little too heavy for the end functionality for the > > user. > > And I'm not sure if we'll be able to reuse this infrastructure for > > anything else. > > I guess this is somewhat of a precedent setting feature for sampling to > different backends. For example, if tomorrow there is another sample > interface for, ex. dpdk. If this is implemented as emit_sample() > action, we can implement this per-datapath. But maybe something else > comes that would be datapath agnostic and we need some new action for > that as well. So I was thinking perhaps it makes sense to use an analog > to mirror that looks "natural." > > BUT, that may never happen and if it doesn't, then as you note, we do > lots of infra work for a feature that only has one user. > > I'm not convinced that there won't be a user, because new interfaces pop > up all the time. But I also don't feel so strongly that it needs to be > done this way. The downside of implementing B is that the second one > that comes will require more actions to be implemented. But we will > still need more vport types to be implemented instead if option C. So I > don't see too much benefit there. > > From a users perspective, debugging with interfaces is maybe a bit > easier. For example, we can track some statistics and support some > additional kinds of controls (like up/down of the interface to turn > samples on/off without needing to rewrite the flows in kernel). But > probably the real end user of the solution would be using something else > like retis or some other fancy UI that hides all the details anyway. > > > I'm not a fan of the design breaking part of the A and duplication of > > attributes. > > +1 > > > So, I'd prefer the B here, but I'm a little biased. > > I'm okay with either B or C approach. > Thanks you all very for your feedback. Option B appears as a viable way forward. I'll continue working on that direction. -- Adrián _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
