On 5/16/24 19:03, Adrian Moreno wrote: > > > On 4/24/24 9:53 PM, Adrian Moreno wrote: >> This is the userspace counterpart of the work being done in the kernel >> [1]. Sending it as RFC to get some early feedback on the overall >> solution. >> >> ** Problem description ** >> Currently, OVS supports several observability features, such as >> per-bridge IPFIX, per-flow IPFIX and sFlow. However, given the >> complexity of OVN-generated pipelines, a single sample is typically not >> enough to troubleshoot what is OVN/OVS is doing with the packet. We need >> highler level metadata alongside the packet sample. >> >> This can be achieved by the means of per-flow IPFIX sampling and >> NXAST_SAMPLE action, through which OVN can add two arbitrary 32-bit >> values (observation_domain_id and observation_point_id) that are sent >> alongside the packet information in the IPFIX frames. >> >> However, there is a fundamental limitation of the existing sampling >> infrastructure, specially in the kernel datapath: samples are handled by >> ovs-vswitchd, forcing the need for an upcall (userspace action). The >> fact that samples share the same netlink sockets and handler thread cpu >> time with actual packet misse, can easily cause packet drops making this >> solution unfit for use in highly loaded production systems. >> >> Users are left with little option than guessing what sampling rate will >> be OK for their traffic pattern and dealing with the lost accuracy. >> >> ** Feature Description ** >> In order to solve this situation and enable this feature to be safely >> turned on in production, this RFC uses the psample support proposed in >> [1] to send NXAST_SAMPLE samples to psample adding the observability >> domain and point information as metadata. >> >> ~~ API ~~ >> The API is simply a new field called "psample_group" in the >> Flow_Sample_Collector_Set (FSCS) table. Configuring this value is >> orthogonal to also associating the FSCS entry to an entry in the IPFIX >> table. >> >> Apart from that configuration, the controller needs to add NXAST_SAMPLE >> actions that refer the entry's id. >> >> ~~ HW Offload ~~ >> psample is already supported by tc using the act_sample action. The >> metadata is taken from the actions' cookie. >> This series also adds support for offloading the odp sample action, >> only when it includes psample information but not nested actions. >> >> A bit of special care has to be taken in the tc layer to not store the >> ufid in the sample's cookie as it's used to carry action-specific data. >> >> ~~ Datapath support ~~ >> This new behavior of the datapth sample action is only supported on the >> kernel datapath. A more detailed analysis is needed (and planned) to >> find a way to also optimize the userspace datapath. However, although >> potentially inefficient, there is not that much risk of dropping packets >> with the existing IPFIX infrastructure. >> >> ~~ Testing ~~ >> The series includes an utility program called "ovs-psample" that listens >> to packets multicasted by the psample module and prints them (also >> printing the obs_domain and obs_point ids). In addition the kernel >> series includes a tracepoint for easy testing and troubleshooting. >> >> [1] https://patchwork.kernel.org/project/netdevbpf/list/?series=847473 >> >> > > Hi all, > > I had an offline meeting with Eelco, Ilya and Aaron about this topic and > wanted > to put out what was discussed and hopefully get more feedback. > > In general, 3 options were considered: > > Option A: Reusing the sample action > =================================== > This is essentially what this proposal (and it's kernel counterpart) consists > in. The datapath action would look like this: > > sample(probability=50%, actions(...), group=10, cookie=0x123) > > Pros > ~~~~ > - In userspace, it fits nicely with the existing per-flow sampling > infrastructure as this RFC exemplifies. > > - The probability is present in the context of sending the packet to psample > so > it's easily carried to psample's rate, making the consumer aware of the > accuracy > of the sample. > > - Relatively easy to implement in tc as a act_sample exists with similar > semantics. > > Cons > ~~~~ > - It breaks the original design of the "sample" action. The "sample" action > means: The packet is sampled (a probability is calculated) and, if the result > is > positive, a set of nested actions is executed. This follows the > "building-blocks" approach of datapath action. This option breaks this > assumption by adding more semantics and behavior to the "sample" action. > > - If we want to add "trunc" to this sampling, the result would probably work > but > is not very nice because we need it outside of the sample() action, e.g: > trunc(100),sample(probability=50%, actions(...), group=10, > cookie=0x123),trunc(0) > > - Makes the uAPI a bit clunky by adding more attributes to an existing, > simple > action. Also, new attributes and existing nested actions are completely > orthogonal so code needs to exist in both userspace and kernel to properly > split > this behavior. > > - When "trunc" is used, psample will report the original skb length, > regardless > of whether the "trunc" is associated to the sample action or not. Not sure > this > is a huge problem nor if it's easy or worth doing it better. > > Option B: Creating a new action: emit_action > ============================================ > Credits to Ilya. Creating a new action to implement this would fit into the > rest > of the actions yielding a flow such: > > sample(probability=50%, > actions(trunc(100),emit_sample(group=10,cookie=0x123))) > > Pros > ~~~~ > - It better aligns with datapath action design by reusing existing building > blocks. > > - Code additions are probably cleaner and easier to review as it's adding a > new > action instead of changing the behavior of an existing one. > > - Fit with existing per-flow IPFIX sampling is equally good since a combined > IPFIX/psample action would just add both "userspace" and "emit_sample" > actions > will be nested inside "sample". > > - "trunc" can now be inside the "sample" nested actions. It should not modify > the packet so, in theory, this should not cause a packet clone. > > Cons > ~~~~ > - The probability is now longer available directly from the context of > "emit_sample". We would need to carry the probability as metadata (in private > skb section) from the outer "sample" to the inner "emit_sample". This > mechanism > is not using explicit actions so it should be very clearly documented. > > - offloading to tc might be a bit more complicated as we need the probability > for act_sample. We would need to introspect the "sample" actions to see if it > _only_ contains "emit_sample" and only then replace it with act_sample.
This is not a real con, because TC doesn't have anything like a generic sample() action with an action list. And offload of sample() action is not supported today. We will always have to introspect the content of the sample() action's action list, because we'll be able to offload only a very narrow subset of these inner actions. Basically, emit_sample() will be an only offloadable option, since sFlow-via-psample approach doesn't seem to be possible. > > > Option C: Using a special vport > =============================== > Credits to Aaron: Adding a special vport type that sends packets to psample. > The > action would look like this: > sample(probability=50%, actions(trunc(100),output(pX)) > > Pros > ~~~~ > - It also aligns well with the datapath building blocks. It is very clear > semantically: "send it to observation" > > - From ovs-vswitchd pov, it resembles per-flow mirroring which is well > integrated. Note: there is no such thing as per-flow mirroring today. But this would be an implementation of it. > > - Statistics are generated in the kernel as other vports. > > Cons > ~~~~ > - The probability is not present in the context of output, nor the output has > this concept currently. We would need to carry it over in the skb private > area, > same as for Option B. > However, documenting this out-of-bands carry over of the sampling probability > and its interaction with other actions can be cleaner if this mechanism only > affects a new action than an existing one (specially if it depends on the > underlying vport type). > > - The group and cookie is not on the output action either so we need new > actions, lets call it SET_OBSERVABILITY_METADATA and a new non-masked key > field > that keeps that metadata around for the action to use it. > > - Currently, "trunc" is implemented (i.e: packet actually gets trimmed) > before > the per-vport-type code is reached, causing the packet to be cloned inside > the > "sample" action. The plan for direct psample execution was to add an > optimization that would memory copies if there were are no listeners in the > multicast group. This nulls-out this potential optimization. > > - Although not super-expensive (if skb is not modified), it _always_ requires > an > skb_clone (even without "trunc"). > > - It requires a very big amount of work in ovs-vswitchd: > - We'd need to decide who creates the port, if it's the user via Openflow > and > it's exposed as a port in a bridge or if it's a hidden vport created by the > dpif > layer. > - Controls need to be established to limit OFP actions to send traffic to > this port or it receiving traffic. > - DPDK datapath would probably require a new netdev_class as well. > > - tc offload is more complicated. If we want to use act_sample (I cannot > think > of a way that doesn't involve act_sample), we'd need to track "sample" > actions > and "set_observability_metadata) to be able to build an act_sample for this. > > Ilya, Aaron, Eelco, did I miss something? > > Thoughts? Preferences? I think, option C is a little too heavy for the end functionality for the user. And I'm not sure if we'll be able to reuse this infrastructure for anything else. I'm not a fan of the design breaking part of the A and duplication of attributes. So, I'd prefer the B here, but I'm a little biased. Best regards, Ilya Maximets. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
