Re: [ovs-discuss] [ovn-controller] Possible duplicated sampling in OpenFlow flows causing IPFIX duplication

Adrián Moreno via discuss Tue, 20 May 2025 01:56:24 -0700

On Tue, May 20, 2025 at 09:34:28AM +0700, Trọng Đạt Trần wrote:
> Hi Dumitru,
>
> Thanks again for confirming the behavior of the sampling rate.
> If the probability is applied per flow, then the current sample_collector
> design is indeed sufficient — we can use metadata to separate ACL sampling
> domains between different teams, which addresses my concern about traffic
> imbalance.
>
> Regarding psample, I’d like to confirm whether there's a minimum required
> kernel version for it to work properly.
>
> From the Q&A section in your OVSCON’24 presentation, I understood that
> kernel *6.11* might be required — but I also saw that psample was
> introduced upstream as early as *4.11*.
>
> My current setup is:
>
>    -
>
>    *OVS Library version*: 3.4.0
>    -
>
>    *Linux Kernel*: 5.15.0-134-generic (Ubuntu)
>
> So far, I haven’t observed any psample activity. Is there a specific kernel
> OVS configuration or module that must be enabled for psample flows to be
> generated?
> If this is something I should explore further on my own, I completely
> understand — just wanted to double-check before diving deeper.


Hi Oscar.

I'm afraid you need a more up-to-date kernel. The feature was merged in
6.11 upstream [1]. It's not a feature in psample (although it had to be
sligthly modified), but a feature in OVS to be able to use psample.

First thing to verify is that psample and OVS support the feature. The
easiest way is to look for "Datapath supports psample action" in
ovs-vswitchd logs.

Then, you need to enable it by configuring the
`Flow_Sample_Collector_Set` entry with an integer value in
`local_group_id` instead of an IPFIX target.

When you do that, you should see the datapath flows contain an action
called `psample` instead of the `flow_sample` action which you are
already familiar with.

After making sure `psample` is being called, your next step is to
consume the samples. psample will create a netlink multicast group, so
you can create a program that attaches to it and reads the samples. An
example is available here [2]. Alternatively, if you want to read
samples even faster, you could sniff the sample directly from
`psample_sample_packet` using eBPF.

Either way, you'll get both the packet content as well as a "cookie"
(SAMPLE_ATTR_USER_COOKIE is the name of the nl attribute). The cookie is
a 64bit value result of concatenating the observation domain and
observation point ids (that you'd get in IPFIX) in network byte order.

Happy hacking.
Adrián

[1] 
https://github.com/torvalds/linux/commit/aae0b82b46cb5004bdf82a000c004d69a0885c33
[2] 
https://github.com/ovn-kubernetes/ovn-kubernetes/blob/75fe04ca6b6df3dc5fcce8c5e81b125c3692ccb9/go-controller/observability-lib/parse_sample.go


>
> Thanks again for all the support and insight you've shared.
>
> Best regards,
> *Oscar*
> On Mon, May 19, 2025 at 3:46 PM Dumitru Ceara <dce...@redhat.com> wrote:
>
> > On 5/19/25 9:01 AM, Trọng Đạt Trần wrote:
> > > Hi Dumitru,
> > >
> >
> > Hi Oscar,
> >
> > > I’d like to verify my understanding of how sampling behaves under traffic
> > > imbalance, specifically when multiple ACLs use the *same
> > sample_collector*.
> > > ------------------------------
> > > 🔧 Simplified Scenario
> > >
> > >    -
> > >
> > >    *ACL_A* (Team A) is configured with:
> > >    -
> > >
> > >       sample action → metadata: 100
> > >       -
> > >
> > >       uses sample_collector_share
> > >       -
> > >
> > >    *ACL_B* (Team B) is configured with:
> > >    -
> > >
> > >       sample action → metadata: 200
> > >       -
> > >
> > >       uses the *same* sample_collector_share
> > >       -
> > >
> > >    sample_collector_share is configured with:
> > >    -
> > >
> > >       probability = 6553 (10%)
> > >
> > > Now assume the following:
> > >
> > >    -
> > >
> > >    90 packets match *ACL_A*
> > >    -
> > >
> > >    10 packets match *ACL_B*
> > >
> > > ------------------------------
> > > ❓Question
> > >
> > > Which of the two behaviors should I expect?
> > >
> > > *(1)* A total of *10 packets randomly sampled* from the full 100 packets,
> > > regardless of metadata (since the sample configuration share the same
> > > sample_collector);
> > > *or*
> > > *(2)* A *proportional sampling* outcome:
> > >
> > >    -
> > >
> > >    9 packets sampled from ACL_A (90 × 10%)
> > >    -
> > >
> > >    1 packet sampled from ACL_B (10 × 10%)
> > >
> > > ------------------------------
> > > 📖 Documentation vs. OpenFlow Action
> > >
> > > The OVN NB schema documentation under Sample_Collector suggests the
> > *first*
> > > interpretation:
> > >
> > > “Probability: Sampling probability for this collector.”
> > >
> > > However, based on your earlier explanation and the OpenFlow action:
> > >
> > > flow_sample(probability=65535, collector_set_id=2, obs_domain_id=...,
> > > obs_point_id=...)
> > >
> > > ... I’m inclined to believe the *second* interpretation is correct, since
> > > each sample action is independently applied with its own metadata
> > > (obs_point_id), even if they point to the same sample_collector.
> > > ------------------------------
> > >
> > > Could you kindly confirm which interpretation is correct?
> > > ------------------------------
> >
> > You're right, the second interpretation is correct.  The probability is
> > associated with the OVS sample() actions the resulting openflows will
> > have.  So, with your example, for the two ACLs we'll have two different
> > openflow rules with different sample actions so the probability will be
> > applied per flow.
> >
> > Maybe we should update the OVN documentation to make it clearer.
> >
> > > 📖 Sample Performance
> > >
> > > Thank you for pointing me to your OVSCON'24 presentation — I had missed
> > it
> > > earlier. It was very informative and gave me a much better understanding
> > of
> > > the potential performance bottlenecks in the current sampling design.
> > >
> > > I'll make sure to explore those aspects further in my upcoming tests.
> > >
> >
> > Ack.
> >
> > > Regarding *psample*, I’d be happy to evaluate its performance when it’s
> > > ready or when support becomes stable in OVN environments. It seems like a
> > > promising direction to offload sampling and reduce vswitchd overhead.
> > >
> >
> > Yes, that was the goal, reduce latency and reduce load on vswitchd.
> >
> > The psample support is stable (part of OVS release 3.4.0 - August 2024).
> >  The only requirement is that the running kernel also supports the
> > datapath action.
> >
> > Regards,
> > Dumitru
> >
> > > Best regards,
> > >
> > > *Oscar*
> > >
> > > On Fri, May 16, 2025 at 8:03 PM Dumitru Ceara <dce...@redhat.com> wrote:
> > >
> > >> On 5/16/25 6:07 AM, Trọng Đạt Trần wrote:
> > >>> Dear Dumitru,
> > >>>
> > >>
> > >> Hi Oscar,
> > >>
> > >>> Thank you for confirming the bug — I’m happy to help however I can.
> > >>> ------------------------------
> > >>> I. Temporary Workaround & Feedback
> > >>>
> > >>> To work around the IPFIX duplication issue in the meantime, I’ve
> > >>> implemented a post-processing filter that divides duplicate samples by
> > >> two.
> > >>> The logic relies on two elements:
> > >>>
> > >>>    1.
> > >>>
> > >>>    *Source and destination MAC addresses* to detect reply traffic from
> > >> VM →
> > >>>    router port.
> > >>>    2.
> > >>>
> > >>>    *Sample metadata* (from the sample entry) to ensure that the match
> > >> comes
> > >>>    from a to-lport ACL.
> > >>>
> > >>> This combination seems to reliably identify duplicated samples. I've
> > >> tested
> > >>> this across multiple scenarios and it works well so far.
> > >>>
> > >>> *Do you foresee any edge cases where this workaround might break down
> > or
> > >>> behave incorrectly?*
> > >>
> > >> At a first glance this seems OK to me.
> > >>
> > >>> ------------------------------
> > >>> II. Questions Regarding OVN Sampling 1. *Sample Collector Table Limits*
> > >>>
> > >>> In my deployment, multiple teams share the network, but generate highly
> > >>> imbalanced traffic. For example:
> > >>>
> > >>>    -
> > >>>
> > >>>    Team A sends 90% of total traffic.
> > >>>    -
> > >>>
> > >>>    Team B sends only 10%.
> > >>>
> > >>> If I configure a shared sample_collector with probability = 6553
> > (≈10%),
> > >>> there’s a chance Team A may generate most or all samples while Team B’s
> > >>> traffic may not be captured at all.
> > >>>
> > >>
> > >> Is traffic from Team A and Team B hitting the same ACLs?  Can't the ACLs
> > >> be partitioned (different port groups) per team?  Then you'd be able to
> > >> use different Sample.metadata for different teams.
> > >>
> > >>> Furthermore, the IPFIX table in the ovsdb would set cache_max_flows
> > >> limits
> > >>> causing team A and B could not be configured on the same set_id.
> > >>>
> > >>> To solve this, I configure one sample_collector per team (different
> > >> set_ids),
> > >>> so each has independent sampling:
> > >>>
> > >>> sample_collector "team_a": id=2, set_id=2
> > >>> sample_collector "team_b": id=1, set_id=1
> > >>>
> > >>> This setup works, but it introduces a potential limitation:
> > >>>
> > >>>    -
> > >>>
> > >>>    Since set_id is limited to 256 values, we can only support up to 256
> > >>>    teams (or Tenants).
> > >>>    -
> > >>>
> > >>>    In multi-tenant environments, this ceiling may be too low.
> > >>>
> > >>> Would it make sense to consider increasing this limit?
> > >>
> > >> Actually, the set_id shouldn't be limited to 8bits, it can be any 32-bit
> > >> value according to the schema:
> > >>
> > >> "set_id": {"type": {"key": {
> > >>     "type": "integer",
> > >>     "minInteger": 1,
> > >>     "maxInteger": 4294967295}}},
> > >>
> > >> As a side thing, now that you mention this, we only use the 8 LSB as
> > >> set_id in the flows we generate.  I think that's a bug and we should
> > >> fix it.  I posted a patch here:
> > >>
> > >> https://mail.openvswitch.org/pipermail/ovs-dev/2025-May/423409.html
> > >>
> > >> However, there is indeed a limit that allows at _most_ 255 unique
> > >> Sample_Collector NB records:
> > >>
> > >> "Sample_Collector": {
> > >>     "columns": {
> > >>         "id": {"type": {"key": {
> > >>             "type": "integer",
> > >>             "minInteger": 1,
> > >>             "maxInteger": 255}}},
> > >>
> > >> That's because we need to store the NB Sample_Collector ID in the
> > >> conntrack mark of the session we're sampling.  CT mark is a 32bit
> > >> value and we use some bits in it for other features:
> > >>
> > >>     expr_symtab_add_subfield_scoped(symtab, "ct_mark.obs_collector_id",
> > >> NULL,
> > >>                                     "ct_mark[16..23]", WR_CT_COMMIT);
> > >>
> > >> Looking at the current code I _think_ we have 8 more bits
> > >> available.  However, expanding the ct_mark.obs_collector_id to use
> > >> the whole remainder of ct_mark (64K values) seems "risky" because
> > >> we don't know before hand if we'll need more bits for other features
> > >> in the future.
> > >>
> > >> Do you have a suggestion of reasonable maximum limit for the number
> > >> of teams (users) in your use case?
> > >>
> > >>> 2. *Sampling Performance Considerations*
> > >>>
> > >>> Here is my current understanding — I’d appreciate confirmation or
> > >>> corrections:
> > >>>
> > >>>    -
> > >>>
> > >>>    Sampling performance is not heavily dependent on ovn-northd or
> > >>>    ovn-controller, since the generation of the sampling flow is
> > >>>    insignificant compared to many other features.
> > >>>    -
> > >>>
> > >>>    In ovs-vswitchd, both memory and CPU usage scale roughly linearly
> > with
> > >>>    the number of active OpenFlow rules using sample(...) actions and
> > the
> > >>>    rate at which those samples are triggered and exported.
> > >>>    -
> > >>>
> > >>>    Under high load, performance can be tuned using the
> > >> cache_active_timeout
> > >>>    and cache_max_flows fields in the IPFIX table. These parameters
> > >> control
> > >>>    export frequency and the size of the flow cache, allowing a balance
> > >> between
> > >>>    monitoring fidelity and resource efficiency.
> > >>>
> > >>> Is this an accurate summary? Or are there other scaling or bottleneck
> > >>> factors I should consider?
> > >>
> > >> I'm not sure if you're aware but OVS (with the kernel netlink datapath
> > and
> > >> on relatively new kernels) supports a different way of sampling,
> > psample.
> > >>
> > >> https://github.com/openvswitch/ovs/commit/1a3bd96
> > >>
> > >> This avoids sending packets all together to vswitchd and allows better
> > >> sampling performance.
> > >>
> > >> This might give more insights, a presentation from OVSCON'24 with an
> > end to
> > >> end solution for sampling network policies (ACLs) with psample in
> > >> ovn-kubernetes:
> > >>
> > >> https://www.youtube.com/watch?v=gLwDsaiUuN4&t=2s
> > >>
> > >>> 3. *Separate Bug Regarding ACL Tier and Sampling*
> > >>>
> > >>> I’ve also observed an issue related to sampling and ACL tier
> > >> interactions.
> > >>> Would you prefer I continue in this thread or open a new one?
> > >>>
> > >>
> > >> It might be better to start a new thread.  Thanks again for trying this
> > >> new feature out!
> > >>
> > >>> Happy to follow your preferred workflow.
> > >>> ------------------------------
> > >>>
> > >>> Thanks again for your time and support.
> > >>>
> > >>> Best regards,
> > >>> *Oscar*
> > >>>
> > >>
> > >> Best regards,
> > >> Dumitru
> > >>
> > >>> On Wed, May 14, 2025 at 5:10 PM Dumitru Ceara <dce...@redhat.com>
> > wrote:
> > >>>
> > >>>> Hi Oscar,
> > >>>>
> > >>>> On 5/13/25 1:04 PM, Dumitru Ceara wrote:
> > >>>>> On 5/13/25 11:06 AM, Trọng Đạt Trần wrote:
> > >>>>>> Dear Dumitru,
> > >>>>>>
> > >>>>>
> > >>>>> Hi Oscar,
> > >>>>>
> > >>>>>> In the previous days, I’ve performed additional tests to gain better
> > >>>>>> understanding around the issue before giving you the details.
> > >>>>>>
> > >>>>>> Thank you for your earlier explanation, it clarified how conntrack
> > and
> > >>>>>> sampling work in the simple "|vm1 --- ls --- vm2"| topology.
> > However,
> > >> I
> > >>>>>> believe my original observations still hold in router related
> > >>>> topologies.
> > >>>>>>
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       Setup Recap
> > >>>>>>
> > >>>>>> *Topology*: vm_a(10.2.1.5) --- ls1 --- router --- ls2 --- vm_b
> > >>>> (10.2.3.5)
> > >>>>>>
> > >>>>>> ACLs applied to a shared Port Group (|pg_d559...|):
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     *ACL A*: |from-lport| – allow-related IPv4 (sample_est =
> > >> |2000000|)
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     *ACL B*: |to-lport| – allow-related ICMP (sample_est =
> > |1000000|)
> > >>>>>>
> > >>>>>> *Sample configuration*:
> > >>>>>>
> > >>>>>>   * ACL A: direction=from-lport, match="inport == @pg && ip4",
> > >>>>>>     sample_est=2000000
> > >>>>>>   * ACL B: direction=to-lport, match="outport == @pg && ip4 &&
> > icmp4",
> > >>>>>>     sample_est=1000000
> > >>>>>>
> > >>>>>>     # ovn-nbctl acl-list pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc
> > >>>>>>       from-lport  1002 (inport ==
> > >>>>>>     @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4) allow-related
> > >>>>>>       to-lport  1002 (outport ==
> > >>>>>>     @pg_d559bf91_b95f_49c0_8e4a_bf35f15e1dcc && ip4 && ip4.src ==
> > >>>>>>     0.0.0.0/0 <http://0.0.0.0/0> && icmp4) allow-related
> > >>>>>>
> > >>>>>> |
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       Expected Behavior (based on your explanation)
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     *First ICMP request*: no sample (ct=new).
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     *First ICMP reply*:
> > >>>>>>
> > >>>>>>       o
> > >>>>>>
> > >>>>>>         One sample from *ingress pipeline* (sample_est = |1000000|)
> > >>>>>>
> > >>>>>>       o
> > >>>>>>
> > >>>>>>         One sample from *egress pipeline* (sample_est = |2000000|)
> > >>>>>>         → *Total: 2 samples* for reply --> True
> > >>>>>>
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       Actual Behavior Observed
> > >>>>>>
> > >>>>>> On the *first ICMP reply*, I see:
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     *3 samples total*:
> > >>>>>>
> > >>>>>>       o
> > >>>>>>
> > >>>>>>         *2 samples* in the *ingress pipeline*, both with |
> > >>>>>>         obs_point_id=1000000|
> > >>>>>>
> > >>>>>>       o
> > >>>>>>
> > >>>>>>         *1 sample* in the egress pipeline, with
> > |obs_point_id=2000000|
> > >>>>>>
> > >>>>>> This results in *duplicated sampling actions for a single logical
> > >>>>>> datapath flow* within the ingress pipeline.
> > >>>>>>
> > >>>>>> Evidence:
> > >>>>>>
> > >>>>>> # ovs-dpctl dump-flows | grep 10.2.1.5
> > >>>>>> recirc_id(0x1d5),in_port(6),ct_state(-new+est-rel+rpl-
> > >>>>>>
> > >>>>
> > >>
> > inv+trk),ct_mark(0x20020/0xff0031),ct_label(0xf4240000000000000000000000000),eth(src=fa:16:3e:6b:42:8e,dst=fa:16:3e:dd:02:c0),eth_type(0x0800),ipv4(src=10.2.1.5,dst=10.2.3.5,proto=1,ttl=64,frag=no),
> > >>>> packets:299, bytes:29302, used:0.376s,
> > >>>>
> > >>
> > actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554437,obs_point_id=1000000,output_port=4294967295)),ct_clear,set(eth(src=fa:16:3e:d5:7b:d1,dst=fa:16:3e:f8:af:7d)),set(ipv4(ttl=63)),ct(zone=21),recirc(0x1d6)
> > >>>>>> |# recirc_id(0x1d5): two flow_sample(...) actions with same metadata
> > >>>>>> (1000000)
> > >>>>>> recirc_id(0x1d6),in_port(6),ct_state(-new+est-rel+rpl-
> > >>>>>>
> > >>>>
> > >>
> > inv+trk),ct_mark(0x20000/0xff0031),ct_label(0x1e8480000000000000000000000000),eth(dst=fa:16:3e:f8:af:7d),eth_type(0x0800),ipv4(dst=10.2.3.5,frag=no),
> > >>>> packets:299, bytes:29302, used:0.376s,
> > >>>>
> > >>
> > actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554439,obs_point_id=2000000,output_port=4294967295)),9
> > >>>>>> |
> > >>>>>> |# plus one flow_sample(...) later in the pipeline with metadata
> > >>>> (2000000)|
> > >>>>>>
> > >>>>>> Also confirmed via IPFIX stats:
> > >>>>>>
> > >>>>>> # IPFIX before ping
> > >>>>>> |sampled pkts: 192758 # After a single ping sampled pkts: 192761 →
> > Δ =
> > >>>> 3|
> > >>>>>>
> > >>>>>>
> > >>>>>>       Additional Findings
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     The issue *only occurs* when VMs are on *separate logical
> > switches
> > >>>>>>     connected by a router*.
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     If both VMs are on the *same logical switch*, IPFIX is correctly
> > >>>>>>     sampled only once per ACL.
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     The duplicated sampling occurs *even if ACL A (IPv4) and ACL C
> > >>>>>>     (IPv6) are unrelated*, as long as both have |sample_est| and
> > >> belong
> > >>>>>>     to the same Port Group.
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     The error can be reproduced *even when only vm_a's Port Group
> > has
> > >>>>>>     the sampling ACLs*. vm_b does not require any sampling
> > >> configuration
> > >>>>>>     for the issue to occur.
> > >>>>>>
> > >>>>>
> > >>>>> Thanks a lot for the follow up!  You're right, this is indeed a bug.
> > >>>>> And that's because we don't clear the packet's ct_state (well all
> > >>>>> conntrack related information) when advancing to the egress pipeline
> > of
> > >>>>> a switch when the outport is one connected to a router.
> > >>>>>
> > >>>>> That's due to https://github.com/ovn-org/ovn/commit/d17ece7 where we
> > >>>>> chose to skip ct_clear if the switch has stateful (allow-related)
> > ACLs:
> > >>>>>
> > >>>>> "Also, this patch does not change the behavior for ACLs such as
> > >>>>> allow-related: packets are still sent to conntrack, even for router
> > >>>>> ports. While this does not work if router ports are distributed,
> > >>>>> allow-related ACLs work today on router ports when those ports are
> > >>>>> handled on the same chassis for ingress and egress traffic. This
> > patch
> > >>>>> does not change that behavior."
> > >>>>>
> > >>>>> On a second look, the above reasoning seems wrong.  It doesn't sound
> > OK
> > >>>>> to rely on conntrack state retrieved from a CT zone that's not
> > assigned
> > >>>>> to the logical port we're processing the packet on.
> > >>>>>
> > >>>>> I'm going to think about the right way to fix this issue and come
> > back
> > >>>>> to this thread once it's figured out.
> > >>>>>
> > >>>>
> > >>>> It turns out the fix is not necessarily that straight forward.  There
> > >>>> are a few different ways to address this though.  As we (Red Hat) are
> > >>>> also using this feature, I opened a ticket in our internal tracking
> > >>>> system so that we analyze it in more depth.
> > >>>>
> > >>>> https://issues.redhat.com/browse/FDP-1408
> > >>>>
> > >>>> However, if the OVN community in general is willing to look at fixing
> > >>>> this bug that would be great too.
> > >>>>
> > >>>> Regards,
> > >>>> Dumitru
> > >>>>
> > >>>>> Thanks again for the bug report!
> > >>>>>
> > >>>>> Regards,
> > >>>>> Dumitru
> > >>>>>
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       Another Reproducible Scenario (Minimal)
> > >>>>>>
> > >>>>>> Port Group A on |vm_a| with:
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     ACL A: |from-lport| IP4 (sample_est or not)
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     ACL B: |to-lport| ICMP |sample_est=1000000|
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     ACL C: |from-lport| IP6 sample_est=2000000
> > >>>>>>
> > >>>>>> Port Group B on |vm_b|:
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     No sampling required
> > >>>>>>
> > >>>>>>   *
> > >>>>>>
> > >>>>>>     ACL to allow from-lport and to-lport traffic
> > >>>>>>
> > >>>>>> When pinging |vm_a| from |vm_b|, the ICMP reply still results in
> > *two
> > >>>>>> samples with |obs_point_id=1000000|*.
> > >>>>>>
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       📌 Key Takeaway
> > >>>>>>
> > >>>>>> I believe this confirms the IPFIX duplication issue is *not due to
> > >>>>>> conntrack behavior*, but rather due to *how multiple ACLs with
> > >>>>>> sample_est on the same Port Group (in different directions) result
> > in
> > >>>>>> twice |userspace(flow_sample(...))| actions* in the same flow.
> > >>>>>>
> > >>>>>>
> > >> ------------------------------------------------------------------------
> > >>>>>>
> > >>>>>>
> > >>>>>>       To avoid overloading the email, I’ve included more detailed
> > >> output
> > >>>>>>       and explanations in the attachment.
> > >>>>>>
> > >>>>>>
> > >>>>>>       This email uses formatting elements such as icons, headers,
> > and
> > >>>>>>       dividers for clarity. If you experience any display issues,
> > >> please
> > >>>>>>       let me know and I’ll avoid using them in future messages.
> > >>>>>>
> > >>>>>>
> > >>>>>>       Please tell me if I can run any additional traces. I’m happy
> > to
> > >>>>>>       assist further.
> > >>>>>>
> > >>>>>>
> > >>>>>>       Best regards,
> > >>>>>>
> > >>>>>>
> > >>>>>>       *Oscar*
> > >>>>>>
> > >>>>>> |
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, May 9, 2025 at 7:16 PM Dumitru Ceara <dce...@redhat.com
> > >>>>>> <mailto:dce...@redhat.com>> wrote:
> > >>>>>>
> > >>>>>>     On 5/9/25 2:14 PM, Dumitru Ceara wrote:
> > >>>>>>     > On 5/9/25 5:38 AM, Trọng Đạt Trần wrote:
> > >>>>>>     >> Hi Dimitru,
> > >>>>>>     >>
> > >>>>>>     >
> > >>>>>>     > Hi Oscar,
> > >>>>>>     >
> > >>>>>>     >
> > >>>>>>     >> Thank you for pointing that out.
> > >>>>>>     >>
> > >>>>>>     >> To clarify: the terms “inbound” and “outbound” in my previous
> > >>>> message
> > >>>>>>     >> were used from the *VM’s perspective*.
> > >>>>>>     >>
> > >>>>>>     >>
> > >>>>>>     >>       Topology:
> > >>>>>>     >>
> > >>>>>>     >> |vm_a ---- network1 ---- router ---- network2 ---- vm_b |
> > >>>>>>     >>
> > >>>>>>     >>
> > >>>>>>     >>       ACLs:
> > >>>>>>     >>
> > >>>>>>     >>   *
> > >>>>>>     >>
> > >>>>>>     >>     *ACL A*: allow-related VMs to *send* IPv4 traffic (|
> > >>>>>>     direction=from-
> > >>>>>>     >>     lport|)
> > >>>>>>     >>
> > >>>>>>     >>   *
> > >>>>>>     >>
> > >>>>>>     >>     *ACL B*: allow-related VMs to *receive* ICMP traffic (|
> > >>>>>>     direction=to-
> > >>>>>>     >>     lport|)
> > >>>>>>     >>
> > >>>>>>     >> I’ve attached both the *Northbound and Southbound database
> > >>>> dumps* to
> > >>>>>>     >> ensure the full context is available.
> > >>>>>>     >>
> > >>>>>>     >
> > >>>>>>     > Thanks for the info, I tried locally with a simplified setup
> > >>>> where I
> > >>>>>>     > emulate your topology:
> > >>>>>>     >
> > >>>>>>     > switch c9c171ef-849c-436d-b3f9-73d83b9c4e5d (ls)
> > >>>>>>     >     port vm2
> > >>>>>>     >         addresses: ["00:00:00:00:00:02"]
> > >>>>>>     >     port vm1
> > >>>>>>     >         addresses: ["00:00:00:00:00:01"]
> > >>>>>>     >
> > >>>>>>     > Those two VIFs are in a port group:
> > >>>>>>     >
> > >>>>>>     > # ovn-nbctl list port_group
> > >>>>>>     > _uuid               : 7e7a96b9-e708-4eea-b380-018314f2435c
> > >>>>>>     > acls                : [1d0e7b71-ff03-4c78-ace4-2448bf237e11,
> > >>>>>>     > 7cb023e9-fee5-4576-a67d-ce1f5d98805b]
> > >>>>>>     > external_ids        : {}
> > >>>>>>     > name                : pg
> > >>>>>>     > ports               : [d991baa6-21b0-4d46-a15d-71b9e8d6708d,
> > >>>>>>     > f2c5679c-d891-4d34-8402-8bc2047fba61]
> > >>>>>>     >
> > >>>>>>     > With two ACLs applied:
> > >>>>>>     > # ovn-nbctl acl-list pg
> > >>>>>>     > from-lport   100 (inport==@pg && ip4) allow-related
> > >>>>>>     >   to-lport   200 (outport==@pg && ip4 && icmp4) allow-related
> > >>>>>>     >
> > >>>>>>     > Both ACLs have only sampling for established traffic
> > >> (sample_est)
> > >>>> set:
> > >>>>>>     > # ovn-nbctl list acl
> > >>>>>>     > _uuid               : 1d0e7b71-ff03-4c78-ace4-2448bf237e11
> > >>>>>>     > action              : allow-related
> > >>>>>>     > direction           : from-lport
> > >>>>>>     > match               : "inport==@pg && ip4"
> > >>>>>>     > priority            : 100
> > >>>>>>     > sample_est          : 23153fae-0a73-4f86-bdf2-137e76647da8
> > >>>>>>     > sample_new          : []
> > >>>>>>     >
> > >>>>>>     > _uuid               : 7cb023e9-fee5-4576-a67d-ce1f5d98805b
> > >>>>>>     > action              : allow-related
> > >>>>>>     > direction           : to-lport
> > >>>>>>     > match               : "outport==@pg && ip4 && icmp4"
> > >>>>>>     > priority            : 200
> > >>>>>>     > sample_est          : 42391c82-23d2-4f2b-a7b9-88afaa68282c
> > >>>>>>     > sample_new          : []
> > >>>>>>     >
> > >>>>>>     > # ovn-nbctl list sample
> > >>>>>>     > _uuid               : 23153fae-0a73-4f86-bdf2-137e76647da8
> > >>>>>>     > collectors          : [82540855-dcd4-44e4-8354-e08a972500cd]
> > >>>>>>     > metadata            : 2000000
> > >>>>>>     >
> > >>>>>>     > _uuid               : 42391c82-23d2-4f2b-a7b9-88afaa68282c
> > >>>>>>     > collectors          : [82540855-dcd4-44e4-8354-e08a972500cd]
> > >>>>>>     > metadata            : 1000000
> > >>>>>>     >
> > >>>>>>     > Then I send a single ICMP echo packet from vm2 towards vm1.
> > The
> > >>>> ICMP
> > >>>>>>     > echo hits both ACLs but because it's the packet initiating the
> > >>>> session
> > >>>>>>     > doesn't generate a sample (sample_new is not set in the ACLs).
> > >>>>>>     Instead
> > >>>>>>     > 2 conntrack entries are created for the ICMP session:
> > >>>>>>     >
> > >>>>>>     > - one in the CT zone of vm2 - here the from-lport ACL is hit
> > so
> > >>>> the
> > >>>>>>     > sample_est metadata of the from-lport ACL (200000) is stored
> > >>>> along in
> > >>>>>>     > the conntrack state
> > >>>>>>     >
> > >>>>>>     > - one in the CT zone of vm1 - here the tolport ACL is hit so
> > the
> > >>>>>>     > sample_est metadata of the to-lport ACL (100000) is stored
> > along
> > >>>>>>     in the
> > >>>>>>     > conntrack state
> > >>>>>>     >
> > >>>>>>     > The ICMP echo packet reaches vm1 which replies with ICMP ECHO
> > >>>> Reply.
> > >>>>>>     >
> > >>>>>>     > For the reply the CT zone of vm1 is first checked, we match
> > the
> > >>>>>>     existing
> > >>>>>>     > conntrack entry (its state moves to "established") and a
> > sample
> > >>>>>>     for the
> > >>>>>>     > stored metadata, 100000, is generated.  Then, in the egress
> > >>>> pipeline,
> > >>>>>>     > the CT zone of vm2 is checked, we match the other existing
> > >>>> conntrack
> > >>>>>>     > entry (its state also moves to "established") and a sample for
> > >> the
> > >>>>>>     > stored metadata, 200000, is generated.
> > >>>>>>     >
> > >>>>>>     > This seems correct to me.  Stats also seem to confirm that:
> > >>>>>>     > # ip netns exec vm2 ping 42.42.42.2 -c1
> > >>>>>>     > PING 42.42.42.2 (42.42.42.2) 56(84) bytes of data.
> > >>>>>>     > 64 bytes from 42.42.42.2 <http://42.42.42.2>: icmp_seq=1
> > ttl=64
> > >>>>>>     time=1.46 ms
> > >>>>>>     >
> > >>>>>>     > --- 42.42.42.2 ping statistics ---
> > >>>>>>     > 1 packets transmitted, 1 received, 0% packet loss, time 0ms
> > >>>>>>     > rtt min/avg/max/mdev = 1.455/1.455/1.455/0.000 ms
> > >>>>>>     >
> > >>>>>>     > # ovs-ofctl dump-ipfix-flow br-int
> > >>>>>>     > NXST_IPFIX_FLOW reply (xid=0x2): 1 ids
> > >>>>>>     >   id   2: flows=2, current flows=0, sampled pkts=2, ipv4 ok=2,
> > >>>> ipv6
> > >>>>>>     > ok=0, tx pkts=11
> > >>>>>>     >           pkts errs=0, ipv4 errs=0, ipv6 errs=0, tx errs=11
> > >>>>>>     >
> > >>>>>>     > But then, when I increase the number of packets things become
> > >> more
> > >>>>>>     > interesting.  ICMP echos also generate samples.  And while
> > that
> > >>>> might
> > >>>>>>     > seem like a bug, it's not. :)
> > >>>>>>     >
> > >>>>>>     > When ping sends multiple packets for a single invocation it
> > uses
> > >>>> the
> > >>>>>>     > same ICMP ID and just increments the ICMP seq, e.g.:
> > >>>>>>     >
> > >>>>>>     > 14:07:41.986618 00:00:00:00:00:02 > 00:00:00:00:00:01,
> > ethertype
> > >>>> IPv4
> > >>>>>>     > (0x0800), length 98: (tos 0x0, ttl 64, id 58647, offset 0,
> > flags
> > >>>> [DF],
> > >>>>>>     > proto ICMP (1), length 84)
> > >>>>>>     >     42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo
> > >>>>>>     request, id 35717, seq 1, length 64
> > >>>>>>     >
> > >>>>>>     > 14:07:42.988077 00:00:00:00:00:02 > 00:00:00:00:00:01,
> > ethertype
> > >>>> IPv4
> > >>>>>>     > (0x0800), length 98: (tos 0x0, ttl 64, id 59085, offset 0,
> > flags
> > >>>> [DF],
> > >>>>>>     > proto ICMP (1), length 84)
> > >>>>>>     >     42.42.42.3 > 42.42.42.2 <http://42.42.42.2>: ICMP echo
> > >>>>>>     request, id 35717, seq 2, length 64
> > >>>>>>     >
> > >>>>>>     > But conntrack doesn't use the ICMP ID in the key for the
> > session
> > >>>> it
> > >>>>>>     > installs:
> > >>>>>>
> > >>>>>>     Sorry about the typo, I meant to say "conntrack doesn't use the
> > >>>> ICMP SEQ
> > >>>>>>     in the key for the session it installs, it only uses the ICMP
> > ID".
> > >>>>>>
> > >>>>>>     >
> > >>>>>>     > # ovs-appctl dpctl/dump-conntrack | grep 42.42.42
> > >>>>>>     >
> > >>>>>>
> > >>>>
> > >>
> > icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=4,mark=131104,labels=0xf4240000000000000000000000000
> > >>>>>>     >
> > >>>>>>
> > >>>>
> > >>
> > icmp,orig=(src=42.42.42.3,dst=42.42.42.2,id=35628,type=8,code=0),reply=(src=42.42.42.2,dst=42.42.42.3,id=35628,type=0,code=0),zone=6,mark=131072,labels=0x1e8480000000000000000000000000
> > >>>>>>     >
> > >>>>>>     > So, subsequent ICMP requests will match on these two existing
> > >>>>>>     > established entries and (because sampling_est) is configured
> > >>>>>>     samples are
> > >>>>>>     > generated for them too.
> > >>>>>>     >
> > >>>>>>     > That's also visible in the datapath flows that forward packets
> > >> in
> > >>>> the
> > >>>>>>     > "original" direction (ICMP ECHOs in our case):
> > >>>>>>     >
> > >>>>>>     > # ovs-appctl dpctl/dump-flows | grep sample | grep '\-rpl'
> > >>>>>>     > recirc_id(0x29),in_port(3),ct_state(-new+est-rel-rpl-
> > >>>>>>
> > >>>>
> > >>
> > inv+trk),ct_mark(0x20000/0xff0071),ct_label(0x1e8480000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:01),eth_type(0x0800),ipv4(proto=1,frag=no),
> > >>>>>>     > packets:8, bytes:784, used:2.342s,
> > >>>>>>     >
> > >>>>>>
> > >>>>
> > >>
> > actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=2000000,output_port=4294967295)),ct(commit,zone=6,mark=0x20000/0xff0071,label=0x1e8480000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),ct(zone=4),recirc(0x2a)
> > >>>>>>     >
> > >>>>>>     > recirc_id(0x2a),in_port(3),ct_state(-new+est-rel-rpl-
> > >>>>>>
> > >>>>
> > >>
> > inv+trk),ct_mark(0x20020/0xff0071),ct_label(0xf4240000000000000000000000000),eth(src=00:00:00:00:00:02,dst=00:00:00:00:00:00/ff:ff:00:00:00:00),eth_type(0x0800),ipv4(proto=1,frag=no),
> > >>>>>>     > packets:8, bytes:784, used:2.342s,
> > >>>>>>     >
> > >>>>>>
> > >>>>
> > >>
> > actions:userspace(pid=4294967295,flow_sample(probability=65535,collector_set_id=2,obs_domain_id=33554434,obs_point_id=1000000,output_port=4294967295)),ct(commit,zone=4,mark=0x20020/0xff0071,label=0xf4240000000000000000000000000/0xffffffffffff00000000000000000000,nat(src)),1
> > >>>>>>     >
> > >>>>>>     > So, for a less complicated test, maybe you should try with
> > >> UDP/TCP
> > >>>>>>     instead.
> > >>>>>>     >
> > >>>>>>     > I hope that clarifies your doubts.
> > >>>>>>     >
> > >>>>>>     > Best regards,
> > >>>>>>     > Dumitru
> > >>>>>>     >
> > >>>>>>     >> Best regards,
> > >>>>>>     >>
> > >>>>>>     >> Oscar
> > >>>>>>     >>
> > >>>>>>     >>
> > >>>>>>     >> On Thu, May 8, 2025 at 8:11 PM Dumitru Ceara <
> > >> dce...@redhat.com
> > >>>>>>     <mailto:dce...@redhat.com>
> > >>>>>>     >> <mailto:dce...@redhat.com <mailto:dce...@redhat.com>>>
> > wrote:
> > >>>>>>     >>
> > >>>>>>     >>     Hi Oscar,
> > >>>>>>     >>
> > >>>>>>     >>     On 5/6/25 12:31 PM, Trọng Đạt Trần wrote:
> > >>>>>>     >>     > As requested, I’ve attached additional tracing
> > >> information
> > >>>>>>     related to
> > >>>>>>     >>     > the sampling duplication issue.
> > >>>>>>     >>     >
> > >>>>>>     >>     >   *
> > >>>>>>     >>     >
> > >>>>>>     >>     >     The file |ofproto_trace.log| contains the full
> > output
> > >>>>>>     of |ofproto/
> > >>>>>>     >>     >     trace| commands.
> > >>>>>>     >>     >
> > >>>>>>     >>     >   *
> > >>>>>>     >>     >
> > >>>>>>     >>     >     The archive |ovn-detrace.tar.gz| includes six
> > >> separate
> > >>>>>>     files, each
> > >>>>>>     >>     >     corresponding to an |ovn-detrace| output for a
> > flow I
> > >>>>>>     believe is
> > >>>>>>     >>     >     involved in the duplicated sampling.
> > >>>>>>     >>     >
> > >>>>>>     >>     > Since I’m not fully confident in how to use |--ct-next
> > >>>>>>     option|, I’ve
> > >>>>>>     >>     > included traces for all six related flows to ensure
> > >>>>>>     completeness.
> > >>>>>>     >>     >
> > >>>>>>     >>     > Please let me know if you need further details, or if I
> > >>>>>>     should re-run
> > >>>>>>     >>     > any commands with additional options.
> > >>>>>>     >>     >
> > >>>>>>     >>
> > >>>>>>     >>     This seems fairly easy to reproduce locally for
> > >>>>>>     investigation; I didn't
> > >>>>>>     >>     try yet though.  However, would you mind sharing your OVN
> > >> NB
> > >>>>>>     database
> > >>>>>>     >>     file (I'm assuming this is a test environment)?
> > >>>>>>     >>
> > >>>>>>     >>     I would like to make sure we don't have any
> > >> misunderstanding
> > >>>>>>     because the
> > >>>>>>     >>     terms you use below in your ACL description (e.g.,
> > >>>>>>     "outbound"/"inbound")
> > >>>>>>     >>     are not standard terms.  Having the actual ACL (and the
> > >> rest
> > >>>>>>     of the NB)
> > >>>>>>     >>     contents will make it easier to debug.
> > >>>>>>     >>
> > >>>>>>     >>     Thanks,
> > >>>>>>     >>     Dumitru
> > >>>>>>     >>
> > >>>>>>     >>     > Best regards,
> > >>>>>>     >>     >
> > >>>>>>     >>     > *Oscar*
> > >>>>>>     >>     >
> > >>>>>>     >>     >
> > >>>>>>     >>     > On Tue, May 6, 2025 at 4:15 PM Adrián Moreno
> > >>>>>>     <amore...@redhat.com <mailto:amore...@redhat.com>
> > >>>>>>     >>     <mailto:amore...@redhat.com <mailto:amore...@redhat.com
> > >>
> > >>>>>>     >>     > <mailto:amore...@redhat.com <mailto:
> > amore...@redhat.com>
> > >>>>>>     <mailto:amore...@redhat.com <mailto:amore...@redhat.com>>>>
> > >> wrote:
> > >>>>>>     >>     >
> > >>>>>>     >>     >     On Tue, May 06, 2025 at 11:48:07AM +0700, Trọng Đạt
> > >>>>>>     Trần wrote:
> > >>>>>>     >>     >     > Dear Adrián,
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > Thank you for your response. I’ve applied your
> > >>>>>>     suggestion to use
> > >>>>>>     >>     >     separate
> > >>>>>>     >>     >     > sample entries for each ACL. However, I am still
> > >>>> seeing
> > >>>>>>     >>     unexpected
> > >>>>>>     >>     >     behavior
> > >>>>>>     >>     >     > in the IPFIX output that I’d like to clarify.
> > >>>>>>     >>     >     > Test Setup (Same as Before)
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > vm_a ---- network1 ---- router ---- network2 ----
> > >>>> vm_b
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    Two ACLs:
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >       ACL A: allow-related *outbound* IPv4
> > >>>>>>     >>     >     >       -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >       ACL B: allow-related *inbound* ICMP
> > >>>>>>     >>     >     >       -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    ACLs applied symmetrically to both VMs.
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    Test traffic: ICMP request from vm_b to vm_a,
> > >> and
> > >>>>>>     reply from
> > >>>>>>     >>     >     vm_a to vm_b
> > >>>>>>     >>     >     >    .
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > Key Problem Observed
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > When sampling is enabled on *both* ACLs, the
> > IPFIX
> > >>>>>>     record for
> > >>>>>>     >>     >     *flow (3)*
> > >>>>>>     >>     >     > (the ICMP reply from vm_a → router) shows *120
> > >>>>>>     packets/min*.
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > However:
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    If *only ACL B* (inbound ICMP) is sampled →
> > (3)
> > >> =
> > >>>> 60
> > >>>>>>     >>     packets/min
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    If *only ACL A* (outbound IP4) is sampled →
> > (3)
> > >>>>>>     not present
> > >>>>>>     >>     >     >    -
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     >    If both are sampled → (3) = 120 packets/min
> > >>>>>>     >>     >     >
> > >>>>>>     >>     >     > This suggests that *flow (3) is being sampled
> > >> twice*
> > >>>>>>     — even
> > >>>>>>     >>     though it
> > >>>>>>     >>     >     > represents a *single logical flow and matches
> > only
> > >>>>>>     ACL B*.
> > >>>>>>     >>     >     > IPFIX Observations
> > >>>>>>     >>     >     > FlowDescriptionExpectedActual
> > >>>>>>     >>     >     > (1) vm_b → router (ICMP request) 60 pkt/m 60
> > >>>>>>     >>     >     > (2) router → vm_a (ICMP request) 60 pkt/m 60
> > >>>>>>     >>     >     > (3) vm_a → router (ICMP reply) 60 pkt/m 120 ⚠️
> > >>>>>>     >>     >     > (4) router → vm_b (ICMP reply) 60 pkt/m 60
> > >>>>>>     >>     >
> > >>>>>>     >>     >     This is not what I'd expect, maybe Dumitru knows?
> > >>>>>>     >>     >
> > >>>>>>     >>     >     Could you attach ofproto/trace and ovn-detrce
> > outputs
> > >>>>>>     from both
> > >>>>>>     >>     >     directions?
> > >>>>>>     >>     >
> > >>>>>>     >>     >     Thanks.
> > >>>>>>     >>     >     Adrián
> > >>>>>>     >>     >
> > >>>>>>     >>
> > >>>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> > >
> >
> >

_______________________________________________
discuss mailing list
disc...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-discuss

Re: [ovs-discuss] [ovn-controller] Possible duplicated sampling in OpenFlow flows causing IPFIX duplication

Reply via email to