On 21 Feb 2024, at 17:45, Adrian Moreno wrote:

> On 2/21/24 16:42, Eelco Chaudron wrote:
>>
>>
>> On 21 Feb 2024, at 16:03, Adrian Moreno wrote:
>>
>>> On 2/21/24 15:49, Aaron Conole wrote:
>>>> Adrian Moreno <amore...@redhat.com> writes:
>>>>
>>>>> On 2/20/24 19:06, Aaron Conole wrote:
>>>>>> Eelco Chaudron <echau...@redhat.com> writes:
>>>>>>
>>>>>>> On 19 Feb 2024, at 19:57, Aaron Conole wrote:
>>>>>>>
>>>>>>>> Eelco Chaudron <echau...@redhat.com> writes:
>>>>>>>>
>>>>>>>>> On 12 Feb 2024, at 15:15, Aaron Conole wrote:
>>>>>>>>>
>>>>>>>>>> Aaron Conole <acon...@redhat.com> writes:
>>>>>>>>>>
>>>>>>>>>>> Eelco Chaudron <echau...@redhat.com> writes:
>>>>>>>>>>>
>>>>>>>>>>>> On 2 Feb 2024, at 11:31, Adrian Moreno wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 2/1/24 10:02, Eelco Chaudron wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 31 Jan 2024, at 18:03, Aaron Conole wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Eelco Chaudron <echau...@redhat.com> writes:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> From: Kevin Sprague <ksprague0...@gmail.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> During normal operations, it is useful to understand when a 
>>>>>>>>>>>>>>>>> particular flow
>>>>>>>>>>>>>>>>> gets removed from the system. This can be useful when 
>>>>>>>>>>>>>>>>> debugging performance
>>>>>>>>>>>>>>>>> issues tied to ofproto flow changes, trying to determine 
>>>>>>>>>>>>>>>>> deployed traffic
>>>>>>>>>>>>>>>>> patterns, or while debugging dynamic systems where ports come 
>>>>>>>>>>>>>>>>> and go.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Prior to this change, there was a lack of visibility around 
>>>>>>>>>>>>>>>>> flow expiration.
>>>>>>>>>>>>>>>>> The existing debugging infrastructure could tell us when a 
>>>>>>>>>>>>>>>>> flow was added to
>>>>>>>>>>>>>>>>> the datapath, but not when it was removed or why.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This change introduces a USDT probe at the point where the 
>>>>>>>>>>>>>>>>> revalidator
>>>>>>>>>>>>>>>>> determines that the flow should be removed.  Additionally, we 
>>>>>>>>>>>>>>>>> track the
>>>>>>>>>>>>>>>>> reason for the flow eviction and provide that information as 
>>>>>>>>>>>>>>>>> well.  With
>>>>>>>>>>>>>>>>> this change, we can track the complete flow lifecycle for the 
>>>>>>>>>>>>>>>>> netlink
>>>>>>>>>>>>>>>>> datapath by hooking the upcall tracepoint in kernel, the flow 
>>>>>>>>>>>>>>>>> put USDT, and
>>>>>>>>>>>>>>>>> the revaldiator USDT, letting us watch as flows are added and 
>>>>>>>>>>>>>>>>> removed from
>>>>>>>>>>>>>>>>> the kernel datapath.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> This change only enables this information via USDT probe, so 
>>>>>>>>>>>>>>>>> it won't be
>>>>>>>>>>>>>>>>> possible to access this information any other way (see:
>>>>>>>>>>>>>>>>> Documentation/topics/usdt-probes.rst).
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also included is a script 
>>>>>>>>>>>>>>>>> (utilities/usdt-scripts/flow_reval_monitor.py)
>>>>>>>>>>>>>>>>> which serves as a demonstration of how the new USDT probe 
>>>>>>>>>>>>>>>>> might be used
>>>>>>>>>>>>>>>>> going forward.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Kevin Sprague <ksprague0...@gmail.com>
>>>>>>>>>>>>>>>>> Co-authored-by: Aaron Conole <acon...@redhat.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Aaron Conole <acon...@redhat.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks for following this up Aaron! See comments on this patch
>>>>>>>>>>>> below. I have no additional comments on patch 2.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Eelco
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>      Documentation/topics/usdt-probes.rst         |   1 +
>>>>>>>>>>>>>>>>>      ofproto/ofproto-dpif-upcall.c                |  42 +-
>>>>>>>>>>>>>>>>>      utilities/automake.mk                        |   3 +
>>>>>>>>>>>>>>>>>      utilities/usdt-scripts/flow_reval_monitor.py | 653 
>>>>>>>>>>>>>>>>> +++++++++++++++++++
>>>>>>>>>>>>>>>>>      4 files changed, 693 insertions(+), 6 deletions(-)
>>>>>>>>>>>>>>>>>      create mode 100755 
>>>>>>>>>>>>>>>>> utilities/usdt-scripts/flow_reval_monitor.py
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/Documentation/topics/usdt-probes.rst
>>>>> b/Documentation/topics/usdt-probes.rst
>>>>>>>>>>>>>>>>> index e527f43bab..a8da9bb1f7 100644
>>>>>>>>>>>>>>>>> --- a/Documentation/topics/usdt-probes.rst
>>>>>>>>>>>>>>>>> +++ b/Documentation/topics/usdt-probes.rst
>>>>>>>>>>>>>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``:
>>>>>>>>>>>>>>>>>      - dpif_recv:recv_upcall
>>>>>>>>>>>>>>>>>      - main:poll_block
>>>>>>>>>>>>>>>>>      - main:run_start
>>>>>>>>>>>>>>>>> +- revalidate:flow_result
>>>>>>>>>>>>>>>>>      - revalidate_ukey\_\_:entry
>>>>>>>>>>>>>>>>>      - revalidate_ukey\_\_:exit
>>>>>>>>>>>>>>>>>      - udpif_revalidator:start_dump
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> You are missing the specific flow_result result
>>>>> section. This is from the previous patch:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> D'oh!  Thanks for catching it.  I'll re-add it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> @@ -358,6 +360,27 @@  See also the ``main:run_start`` probe 
>>>>>>>>>>>>>>>> above.
>>>>>>>>>>>>>>>>      - ``utilities/usdt-scripts/bridge_loop.bt``
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +probe revalidate:flow_result
>>>>>>>>>>>>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +**Description**:
>>>>>>>>>>>>>>>> +This probe is triggered when the revalidator decides whether 
>>>>>>>>>>>>>>>> or not to
>>>>>>>>>>>>>>>> +revalidate a flow. ``reason`` is an enum that denotes that 
>>>>>>>>>>>>>>>> either the flow
>>>>>>>>>>>>>>>> +is being kept, or the reason why the flow is being deleted. 
>>>>>>>>>>>>>>>> The
>>>>>>>>>>>>>>>> +``flow_reval_monitor.py`` script uses this probe to notify 
>>>>>>>>>>>>>>>> users when flows
>>>>>>>>>>>>>>>> +matching user-provided criteria are deleted.
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +**Arguments**:
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +- *arg0*: ``(enum flow_del_reason) reason``
>>>>>>>>>>>>>>>> +- *arg1*: ``(struct udpif *) udpif``
>>>>>>>>>>>>>>>> +- *arg2*: ``(struct udpif_key *) ukey``
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +**Script references**:
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py``
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>      Adding your own probes
>>>>>>>>>>>>>>>>      ----------------------
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c 
>>>>>>>>>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>>>>>>>> index b5cbeed878..97d75833f7 100644
>>>>>>>>>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>>>>>>>>>>>>>>> @@ -269,6 +269,18 @@ enum ukey_state {
>>>>>>>>>>>>>>>>>      };
>>>>>>>>>>>>>>>>>      #define N_UKEY_STATES (UKEY_DELETED + 1)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +enum flow_del_reason {
>>>>>>>>>>>>>>>>> +    FDR_REVALIDATE = 0,     /* The flow was revalidated. */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It was called FDR_FLOW_LIVE before, which might make more
>>>>>>>>>>>> sense. As the flow is just NOT deleted. It might or might not have
>>>>>>>>>>>> been revalidated. Thoughts?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think it had to have been revalidated if we emit the reason, 
>>>>>>>>>>>>>>> because
>>>>>>>>>>>>>>> we only emit the reason code after revalidation.  IE: there are 
>>>>>>>>>>>>>>> many
>>>>>>>>>>>>>>> places where we skip revalidation but the flow stays live - and 
>>>>>>>>>>>>>>> we don't
>>>>>>>>>>>>>>> emit reasons in those cases.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So at least for this patch, it MUST have been revalidated.  But 
>>>>>>>>>>>>>>> maybe in
>>>>>>>>>>>>>>> the future, we would want to catch cases where the flow hasn't 
>>>>>>>>>>>>>>> been.  In
>>>>>>>>>>>>>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time 
>>>>>>>>>>>>>>> - I
>>>>>>>>>>>>>>> think.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe you disagree?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well, it depends on how you define revalidation, it might only 
>>>>>>>>>>>>>> have
>>>>>>>>>>>> updated the counters. i.e. it all depends on ‘bool need_revalidate 
>>>>>>>>>>>> =
>>>>>>>>>>>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I
>>>>>>>>>>>> opted for a more general name.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +    FDR_FLOW_IDLE,          /* The flow went unused and was 
>>>>>>>>>>>>>>>>> deleted. */
>>>>>>>>>>>>>>>>> +    FDR_TOO_EXPENSIVE,      /* The flow was too expensive to 
>>>>>>>>>>>>>>>>> revalidate. */
>>>>>>>>>>>>>>>>> +    FDR_FLOW_WILDCARDED,    /* The flow needed a narrower 
>>>>>>>>>>>>>>>>> wildcard mask. */
>>>>>>>>>>>>>>>>> +    FDR_BAD_ODP_FIT,        /* The flow had a bad ODP flow 
>>>>>>>>>>>>>>>>> fit. */
>>>>>>>>>>>>>>>>> +    FDR_NO_OFPROTO,         /* The flow didn't have an 
>>>>>>>>>>>>>>>>> associated ofproto. */
>>>>>>>>>>>>>>>>> +    FDR_XLATION_ERROR,      /* There was an error 
>>>>>>>>>>>>>>>>> translating the flow. */
>>>>>>>>>>>>>>>>> +    FDR_AVOID_CACHING,      /* Flow deleted to avoid 
>>>>>>>>>>>>>>>>> caching. */
>>>>>>>>>>>>>>>>> +    FDR_FLOW_LIMIT,         /* All flows being killed. */
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Looking at the comment from Han on FDR_PURGE, and this patch
>>>>>>>>>>>> needing another spin, we should probably add it.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I can do that, sure.  In that case, we will need to have a new 
>>>>>>>>>>>>>>> flow op
>>>>>>>>>>>>>>> added to revalidator_sweep__ so that we can catch it.  But in 
>>>>>>>>>>>>>>> that case,
>>>>>>>>>>>>>>> it will be a different usdt probe, so I still don't know if we 
>>>>>>>>>>>>>>> need
>>>>>>>>>>>>>>> FDR_PURGE right?  WDYT?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In revalidator_sweep__() you have sort of the following:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                     if (purge || ukey_state == 
>>>>>>>>>>>>>> UKEY_INCONSISTENT) {
>>>>>>>>>>>>>>                         result = UKEY_DELETE;
>>>>>>>>>>>>>>                     } else if (!seq_mismatch) {
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> And I’m afraid that if we use this tool to debug we miss the
>>>>>>>>>>>> ukey_state == UKEY_INCONSISTENT when debugging and spent a long 
>>>>>>>>>>>> time
>>>>>>>>>>>> figuring this out.
>>>>>>>>>>>>>> Maybe add something general like this (did not give it a lot of
>>>>>>>>>>>> thought), and only take the FDR_PURGE : FDR_UPDATE_FAIL results in 
>>>>>>>>>>>> the
>>>>>>>>>>>> script?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>      /* 'udpif_key's are responsible for tracking the little bit 
>>>>>>>>>>>>>> of state udpif
>>>>>>>>>>>>>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct
>>>>> revalidator *revalidator, bool purge)
>>>>>>>>>>>>>>              uint64_t odp_actions_stub[1024 / 8];
>>>>>>>>>>>>>>              struct ofpbuf odp_actions = 
>>>>>>>>>>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -        enum flow_del_reason reason = FDR_REVALIDATE;
>>>>>>>>>>>>>>              struct ukey_op ops[REVALIDATE_MAX_BATCH];
>>>>>>>>>>>>>>              struct udpif_key *ukey;
>>>>>>>>>>>>>>              struct umap *umap = &udpif->ukeys[i];
>>>>>>>>>>>>>>              size_t n_ops = 0;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>              CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
>>>>>>>>>>>>>> +            enum flow_del_reason reason = FDR_REVALIDATE;
>>>>>>>>>>>>>>                  enum ukey_state ukey_state;
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                  /* Handler threads could be holding a ukey lock 
>>>>>>>>>>>>>> while it installs a
>>>>>>>>>>>>>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct
>>>>> revalidator *revalidator, bool purge)
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>                      if (purge || ukey_state == 
>>>>>>>>>>>>>> UKEY_INCONSISTENT) {
>>>>>>>>>>>>>>                          result = UKEY_DELETE;
>>>>>>>>>>>>>> +                    reason = purge ? FDR_PURGE : 
>>>>>>>>>>>>>> FDR_UPDATE_FAIL;
>>>>>>>>>>>>>>                      } else if (!seq_mismatch) {
>>>>>>>>>>>>>>                          result = UKEY_KEEP;
>>>>>>>>>>>>>> +                    reason = FDR_REVALIDATE; //_KEEP
>>>>>>>>>>>>>>                      } else {
>>>>>>>>>>>>>>                          struct dpif_flow_stats stats;
>>>>>>>>>>>>>>                          COVERAGE_INC(revalidate_missed_dp_flow);
>>>>>>>>>>>>>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator 
>>>>>>>>>>>>>> *revalidator, bool purge)
>>>>>>>>>>>>>>                          reval_op_init(&ops[n_ops++], result, 
>>>>>>>>>>>>>> udpif, ukey, &recircs,
>>>>>>>>>>>>>>                                        &odp_actions);
>>>>>>>>>>>>>>                      }
>>>>>>>>>>>>>> +                OVS_USDT_PROBE(revalidator_sweep__, 
>>>>>>>>>>>>>> flow_result, result,
>>>>>>>>>>>>>> +                               reason, udpif, ukey);
>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>                  ovs_mutex_unlock(&ukey->mutex);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> In addition in revalidator_sweep__() should the “enum
>>>>>>>>>>>> flow_del_reason reason = FDR_REVALIDATE;” not be moved to the
>>>>>>>>>>>> CMAP_FOR_EACH() loop?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>      /* 'udpif_key's are responsible for tracking the little 
>>>>>>>>>>>>>>>>> bit of state udpif
>>>>>>>>>>>>>>>>>       * needs to do flow expiration which can't be pulled 
>>>>>>>>>>>>>>>>> directly from the
>>>>>>>>>>>>>>>>>       * datapath.  They may be created by any handler or 
>>>>>>>>>>>>>>>>> revalidator thread at any
>>>>>>>>>>>>>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, 
>>>>>>>>>>>>>>>>> struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>      static enum reval_result
>>>>>>>>>>>>>>>>>      revalidate_ukey__(struct udpif *udpif, const struct 
>>>>>>>>>>>>>>>>> udpif_key *ukey,
>>>>>>>>>>>>>>>>>                        uint16_t tcp_flags, struct ofpbuf 
>>>>>>>>>>>>>>>>> *odp_actions,
>>>>>>>>>>>>>>>>> -                  struct recirc_refs *recircs, struct 
>>>>>>>>>>>>>>>>> xlate_cache *xcache)
>>>>>>>>>>>>>>>>> +                  struct recirc_refs *recircs, struct 
>>>>>>>>>>>>>>>>> xlate_cache *xcache,
>>>>>>>>>>>>>>>>> +                  enum flow_del_reason *reason)
>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>          struct xlate_out *xoutp;
>>>>>>>>>>>>>>>>>          struct netflow *netflow;
>>>>>>>>>>>>>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif,
>>>>>>>>>>>> const struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>          netflow = NULL;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) {
>>>>>>>>>>>>>>>>> +        *reason = FDR_XLATION_ERROR;
>>>>>>>>>>>>>>>>>              goto exit;
>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>          xoutp = &ctx.xout;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>          if (xoutp->avoid_caching) {
>>>>>>>>>>>>>>>>> +        *reason = FDR_AVOID_CACHING;
>>>>>>>>>>>>>>>>>              goto exit;
>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif,
>>>>>>>>>>> const struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>              ofpbuf_clear(odp_actions);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>              if (!ofproto) {
>>>>>>>>>>>>>>>>> +            *reason = FDR_NO_OFPROTO;
>>>>>>>>>>>>>>>>>                  goto exit;
>>>>>>>>>>>>>>>>>              }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif,
>>>>>>>>>>> const struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>          if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, 
>>>>>>>>>>>>>>>>> &dp_mask, &ctx.flow,
>>>>>>>>>>>>>>>>>                                   NULL)
>>>>>>>>>>>>>>>>>              == ODP_FIT_ERROR) {
>>>>>>>>>>>>>>>>> +        *reason = FDR_BAD_ODP_FIT;
>>>>>>>>>>>>>>>>>              goto exit;
>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif,
>>>>>>>>>>> const struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>           * down.  Note that we do not know if the datapath 
>>>>>>>>>>>>>>>>> has ignored any of the
>>>>>>>>>>>>>>>>>           * wildcarded bits, so we may be overly conservative 
>>>>>>>>>>>>>>>>> here. */
>>>>>>>>>>>>>>>>>          if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) {
>>>>>>>>>>>>>>>>> +        *reason = FDR_FLOW_WILDCARDED;
>>>>>>>>>>>>>>>>>              goto exit;
>>>>>>>>>>>>>>>>>          }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result
>>>>>>>>>>>>>>>>>      revalidate_ukey(struct udpif *udpif, struct udpif_key 
>>>>>>>>>>>>>>>>> *ukey,
>>>>>>>>>>>>>>>>>                      const struct dpif_flow_stats *stats,
>>>>>>>>>>>>>>>>>                      struct ofpbuf *odp_actions, uint64_t 
>>>>>>>>>>>>>>>>> reval_seq,
>>>>>>>>>>>>>>>>> -                struct recirc_refs *recircs)
>>>>>>>>>>>>>>>>> +                struct recirc_refs *recircs, enum 
>>>>>>>>>>>>>>>>> flow_del_reason *reason)
>>>>>>>>>>>>>>>>>          OVS_REQUIRES(ukey->mutex)
>>>>>>>>>>>>>>>>>      {
>>>>>>>>>>>>>>>>>          bool need_revalidate = ukey->reval_seq != reval_seq;
>>>>>>>>>>>>>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, 
>>>>>>>>>>>>>>>>> struct udpif_key *ukey,
>>>>>>>>>>>>>>>>>                      xlate_cache_clear(ukey->xcache);
>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>                  result = revalidate_ukey__(udpif, ukey, 
>>>>>>>>>>>>>>>>> push.tcp_flags,
>>>>>>>>>>>>>>>>> -                                       odp_actions, recircs, 
>>>>>>>>>>>>>>>>> ukey->xcache);
>>>>>>>>>>>>>>>>> -        } /* else delete; too expensive to revalidate */
>>>>>>>>>>>>>>>>> +                                       odp_actions, recircs, 
>>>>>>>>>>>>>>>>> ukey->xcache,
>>>>>>>>>>>>>>>>> +                                       reason);
>>>>>>>>>>>>>>>>> +        } else {
>>>>>>>>>>>>>>>>> +            /* delete; too expensive to revalidate */
>>>>>>>>>>>>>>>>> +            *reason = FDR_TOO_EXPENSIVE;
>>>>>>>>>>>>>>>>> +        }
>>>>>>>>>>>>>>>>>          } else if (!push.n_packets || ukey->xcache
>>>>>>>>>>>>>>>>>                     || !populate_xcache(udpif, ukey, 
>>>>>>>>>>>>>>>>> push.tcp_flags)) {
>>>>>>>>>>>>>>>>>              result = UKEY_KEEP;
>>>>>>>>>>>>>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator 
>>>>>>>>>>>>>>>>> *revalidator)
>>>>>>>>>>>>>>>>>              for (f = flows; f < &flows[n_dumped]; f++) {
>>>>>>>>>>>>>>>>>                  long long int used = f->stats.used;
>>>>>>>>>>>>>>>>>                  struct recirc_refs recircs = 
>>>>>>>>>>>>>>>>> RECIRC_REFS_EMPTY_INITIALIZER;
>>>>>>>>>>>>>>>>> +            enum flow_del_reason reason = FDR_REVALIDATE;
>>>>>>>>>>>>>>>>>                  struct dpif_flow_stats stats = f->stats;
>>>>>>>>>>>>>>>>>                  enum reval_result result;
>>>>>>>>>>>>>>>>>                  struct udpif_key *ukey;
>>>>>>>>>>>>>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator 
>>>>>>>>>>>>>>>>> *revalidator)
>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>                  if (kill_them_all || (used && used < now - 
>>>>>>>>>>>>>>>>> max_idle)) {
>>>>>>>>>>>>>>>>>                      result = UKEY_DELETE;
>>>>>>>>>>>>>>>>> +                if (kill_them_all) {
>>>>>>>>>>>>>>>>> +                    reason = FDR_FLOW_LIMIT;
>>>>>>>>>>>>>>>>> +                } else {
>>>>>>>>>>>>>>>>> +                    reason = FDR_FLOW_IDLE;
>>>>>>>>>>>>>>>>> +                }
>>>>>>>>>>>>>>>>>                  } else {
>>>>>>>>>>>>>>>>>                      result = revalidate_ukey(udpif, ukey, 
>>>>>>>>>>>>>>>>> &stats, &odp_actions,
>>>>>>>>>>>>>>>>> -                                         reval_seq, 
>>>>>>>>>>>>>>>>> &recircs);
>>>>>>>>>>>>>>>>> +                                         reval_seq, 
>>>>>>>>>>>>>>>>> &recircs, &reason);
>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>                  ukey->dump_seq = dump_seq;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator 
>>>>>>>>>>>>>>>>> *revalidator)
>>>>>>>>>>>>>>>>>                      udpif_update_flow_pps(udpif, ukey, f);
>>>>>>>>>>>>>>>>>                  }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +            OVS_USDT_PROBE(revalidate, flow_result, reason, 
>>>>>>>>>>>>>>>>> udpif, ukey);
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have been experimenting with several upcall tracking techniques
>>>>>>>>>>>> that would make it easier to correlate upcalls with their 
>>>>>>>>>>>> subsequent
>>>>>>>>>>>> related events.
>>>>>>>>>>>>> To achieve that, we need (among other things) some easy-to-compare
>>>>>>>>>>>> unique value in the events. For revalidation events, I think a good
>>>>>>>>>>>> candidate would be "ukey->ufid" and so does the script in this 
>>>>>>>>>>>> patch.
>>>>>>>>>>>>>
>>>>>>>>>>>>> However, requiring all external tools to know the layout of 
>>>>>>>>>>>>> "struct
>>>>>>>>>>>> udpif_key" in order to get that value makes things quite 
>>>>>>>>>>>> complicated
>>>>>>>>>>>> for CORE tools (e.g: retis).
>>>>>>>>>>>>>
>>>>>>>>>>>>> With all this, would you consider adding the ufid to probe 
>>>>>>>>>>>>> payload directly?
>>>>>>>>>>>>
>>>>>>>>>>>> Dont see why we can not, but if we need anything else it would 
>>>>>>>>>>>> quickly
>>>>>>>>>>>> explode in too much arguments. I guess CORE needs a good solution 
>>>>>>>>>>>> for
>>>>>>>>>>>> userspace.
>>>>>>>>>>>
>>>>>>>>>>> I think having the ufid and udpif makes sense for now.  Actually, 
>>>>>>>>>>> maybe
>>>>>>>>>>> for the long term we can do something as an appctl command which 
>>>>>>>>>>> will
>>>>>>>>>>> dump the relevant datastructures.  That way a tool like this, which
>>>>>>>>>>> needs the pid anyway, can generate an appctl command against the 
>>>>>>>>>>> daemon
>>>>>>>>>>> and get the exact right header layout.
>>>>>>>>>>>
>>>>>>>>>>> WDYT?
>>>>>>>>>>
>>>>>>>>>> Update on this - I have it ready to go with the exception of this
>>>>>>>>>> particular change.  The USDT probe code generation seems to have a 
>>>>>>>>>> bug,
>>>>>>>>>> or maybe it just won't work using these kinds of data types.  For
>>>>>>>>>> example, the bcc tools gets confused and ends up generating code for 
>>>>>>>>>> the
>>>>>>>>>> probe like:
>>>>>>>>>>
>>>>>>>>>>        __attribute__((always_inline))
>>>>>>>>>> static __always_inline int _bpf_readarg_usdt__flow_result_3(struct
>>>>>>>>> pt_regs *ctx, void *dest, size_t len) {
>>>>>>>>>>          if (len != sizeof(d)) return -1;
>>>>>>>>>> { u64 __addr = ctx->r13 + 40; __asm__ __volatile__("": : :"memory");
>>>>>>>>> uint128_t __res = 0x0; bpf_probe_read_user(&__res, sizeof(__res),
>>>>>>>>> (void *)__addr); *((d *)dest) = __res; }
>>>>>>>>>>          return 0;
>>>>>>>>>>        }
>>>>>>>>>>
>>>>>>>>>> This code isn't valid - the (d) variable isn't properly represented
>>>>>>>>>> here.  I guess maybe there's some issue with the way they use ctypes
>>>>>>>>>> internally?  It isn't clear to me.
>>>>>>>>>
>>>>>>>>> I miss some context, but is it because you point to ufid, which is a
>>>>>>>>> union (typedef union ovs_u128)? What if you point to the u32 member of
>>>>>>>>> the union?
>>>>>>>>
>>>>>>>> I'm not sure, but that doesn't seem like the best approach to me, tbh.
>>>>>>>>
>>>>>>>>>> We could work around this by making the ufid always pass as a void *,
>>>>>>>>>> and passing a length that tells the size.  However, if the underlying
>>>>>>>>>> data type ever changes, it will be totally incompatible.  I doubt it
>>>>>>>>>> would ever change, but at least with something stronger typed, we 
>>>>>>>>>> could
>>>>>>>>>> get a better exception that bubbled up.
>>>>>>>>>
>>>>>>>>> I do not like the pointer + length, as this adds another argument. I
>>>>>>>>> can’t remember which, but some tools had problems with 5 or more
>>>>>>>>> arguments.
>>>>>>>>
>>>>>>>> I do agree.  I also want to test the latest version of bcc-tools also,
>>>>>>>> to see if it really is just related to my f38 version of bcc-tools.
>>>>>>>> That also raises a different issue, but it can be something we solve in
>>>>>>>> the interrim with a version check.
>>>>>>>>
>>>>>>>>>> I'm okay with doing it this way, but just wanted to make sure that
>>>>>>>>>> retis, et. al. would be okay with interpreting it this way.
>>>>>>>>>>
>>>>>>>>>> Alternatively, there's definitely a bug there - we could omit the
>>>>>>>>>> argument entirely and rely on pahole or even pyelftools to help 
>>>>>>>>>> extract
>>>>>>>>>> the struct layout and get the ufid from the udpif_key struct.  The 
>>>>>>>>>> data
>>>>>>>>>> is already being sent, so it seems like duplication a bit to send it,
>>>>>>>>>> and becomes difficult in light of this bcc "bug?"
>>>>>>>>>
>>>>>>>>> I agree it’s overloading arguments with data already present, but I
>>>>>>>>> guess CORE can not yet read debug data from userspace
>>>>>>>>> applications. Adrian, I thought this was being worked on with the
>>>>>>>>> Pahole guys. If this is close we could just leave out this extra
>>>>>>>>> argument for now.
>>>>>>>>
>>>>>>>> That should be something doable, I think.  For example, there are 
>>>>>>>> crates
>>>>>>>> like gimli that parse DWARF data (but not sure if it is parsing v5,
>>>>>>>> yet).
>>>>>>>>
>>>>>>>>> Alternatively, we could ask the BCC guys if this is a known issue,
>>>>>>>>> with maybe a potential workaround. Just post it here
>>>>>>>>> https://github.com/iovisor/bcc/issues the respond quickly.
>>>>>>>>
>>>>>>>> They will tell to run with the latest version, I think.  The version 
>>>>>>>> I'm
>>>>>>>> using is what ships with Fedora 38, and seems quite out of date.
>>>>>>>
>>>>>>> ACK, will wait for your conclusion and the v9.
>>>>>> After tying to get it working with latest bcc-tools, the behavior is
>>>>>> actually much worse.  Whereas before the flow_reval_monitor script would
>>>>>> fail during the bpf compilation (with a clear error message), the latest
>>>>>> BCC tools will crash the python interpreter.
>>>>>> I'm going to omit the ufid details.  I think CO-RE will have support
>>>>>> anyway for this kinds of output, so v9 I will keep the args as-is and
>>>>>> when the support gets added to decode the debuginfo packages to pull
>>>>>> those structures, they will get the support.
>>>>>>
>>>>>
>>>>> Sorry, maybe I missed something. What support will CO-RE have?
>>>>
>>>> I guess, my understanding was the pahole work would be the thing CO-RE
>>>> needs, but after reading your comments, maybe that isn't true.
>>>>
>>>
>>> Doing CO-RE with some BTF generated by pahole from a debug file is just 
>>> possible in theory. It looks doable by using libbpf directly and even 
>>> there, CO-RE was meant to be used on the vmlinux (or module) BTF. I haven't 
>>> tried it yet.
>>> Common ebpf frameworks such as aya, cillim/bpf, etc don't seem to expose 
>>> that
>>> functionality AFAICS.
>>>
>>> So, not putting the ufid makes things difficult for CO-RE tools for the 
>>> forseable future.
>>>
>>> Also, Eelco, could you expand on the 5 argument limitation? I see 
>>> revalidate_ukey__:entry has 6 arguments and seems to work.
>>
>> The only thing I could remember is that I had problems with some tooling, 
>> but can’t recall which. So it might have been related to the tool, and not 
>> the actual tracepoint insertion part.
>>
>>> OTOH, although the ufid is variable-length, could we assume that at most 
>>> it's 128 bits? That's what the uAPI states. If so, could we send a raw 
>>> pointer and assume that length? I don't know any reason why OVS will at 
>>> some point reduce the size but if it does, it's likely that a 128bit buffer 
>>> will still be used (istead of using dynamic memory).
>>
>> I feel like duplicating arguments to make it easier for external tools, 
>> sounds like a bad design principle.
>>
>
> If this was just another function, or even an API, I'd agree.
> But for a tracepoint, exposing all the data that a user might possibly need 
> without requiring them to do struct introspection is not that bad. The 
> usdt:recv_upcall and both kernel tracepoints are some examples.

As it is a debug hook, which could introduce extra code/ache line misses, I 
prefer the need for extra struct introspection at the debugging script. And yes 
the usdt:recv_upcall (and some others) were my own learning mistakes ;) But the 
problem is that once we add them, we should try (we do not guarantee this) to 
keep them stable.

>>> Sorry for insisting, it's just that I would really like to consume this 
>>> from the retis side.
>>
>> How far away are we from dwarf to BTF, which then can be used by CO-RE? 
>> Maybe we should try this first, and if this is a dead-end we can re-visit 
>> and add the extra argument. If we add the argument now, we can not remove it 
>> later.
>>
>
> I'd say we're pretty far. Even if possible, it would only be doable for 
> programs that use libbpf directly, which are very rare (most use cillim/ebpf 
> or some other framework).
>
> Even for retis, I don't think that calling pahole would be an option (other 
> than for a PoC), we would have to cook our own BTF encoder and use gimli to 
> extract DWARF information.

Doesn’t CO-RE require libbpf anyway due to its need for BTF?

However, if we keep making these changes because CO-RE is missing a major 
feature it might never get solved. I think we should figure out how CO-RE could 
solve this for USDTs. I think pahole uses the dwarves libraries, so maybe part 
of the functionality can be moved there.

Maybe we can get the CO-RE people involved and get them to solve this?

Cheers,

Eelco

_______________________________________________
dev mailing list
d...@openvswitch.org
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to