On 19 Feb 2024, at 19:57, Aaron Conole wrote:
> Eelco Chaudron <echau...@redhat.com> writes: > >> On 12 Feb 2024, at 15:15, Aaron Conole wrote: >> >>> Aaron Conole <acon...@redhat.com> writes: >>> >>>> Eelco Chaudron <echau...@redhat.com> writes: >>>> >>>>> On 2 Feb 2024, at 11:31, Adrian Moreno wrote: >>>>> >>>>>> On 2/1/24 10:02, Eelco Chaudron wrote: >>>>>>> >>>>>>> >>>>>>> On 31 Jan 2024, at 18:03, Aaron Conole wrote: >>>>>>> >>>>>>>> Eelco Chaudron <echau...@redhat.com> writes: >>>>>>>> >>>>>>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote: >>>>>>>>> >>>>>>>>>> From: Kevin Sprague <ksprague0...@gmail.com> >>>>>>>>>> >>>>>>>>>> During normal operations, it is useful to understand when a >>>>>>>>>> particular flow >>>>>>>>>> gets removed from the system. This can be useful when debugging >>>>>>>>>> performance >>>>>>>>>> issues tied to ofproto flow changes, trying to determine deployed >>>>>>>>>> traffic >>>>>>>>>> patterns, or while debugging dynamic systems where ports come and go. >>>>>>>>>> >>>>>>>>>> Prior to this change, there was a lack of visibility around flow >>>>>>>>>> expiration. >>>>>>>>>> The existing debugging infrastructure could tell us when a flow was >>>>>>>>>> added to >>>>>>>>>> the datapath, but not when it was removed or why. >>>>>>>>>> >>>>>>>>>> This change introduces a USDT probe at the point where the >>>>>>>>>> revalidator >>>>>>>>>> determines that the flow should be removed. Additionally, we track >>>>>>>>>> the >>>>>>>>>> reason for the flow eviction and provide that information as well. >>>>>>>>>> With >>>>>>>>>> this change, we can track the complete flow lifecycle for the netlink >>>>>>>>>> datapath by hooking the upcall tracepoint in kernel, the flow put >>>>>>>>>> USDT, and >>>>>>>>>> the revaldiator USDT, letting us watch as flows are added and >>>>>>>>>> removed from >>>>>>>>>> the kernel datapath. >>>>>>>>>> >>>>>>>>>> This change only enables this information via USDT probe, so it >>>>>>>>>> won't be >>>>>>>>>> possible to access this information any other way (see: >>>>>>>>>> Documentation/topics/usdt-probes.rst). >>>>>>>>>> >>>>>>>>>> Also included is a script >>>>>>>>>> (utilities/usdt-scripts/flow_reval_monitor.py) >>>>>>>>>> which serves as a demonstration of how the new USDT probe might be >>>>>>>>>> used >>>>>>>>>> going forward. >>>>>>>>>> >>>>>>>>>> Signed-off-by: Kevin Sprague <ksprague0...@gmail.com> >>>>>>>>>> Co-authored-by: Aaron Conole <acon...@redhat.com> >>>>>>>>>> Signed-off-by: Aaron Conole <acon...@redhat.com> >>>>>>>>> >>>>>>>>> Thanks for following this up Aaron! See comments on this patch >>>>> below. I have no additional comments on patch 2. >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> >>>>>>>>> Eelco >>>>>>>>> >>>>>>>>> >>>>>>>>>> --- >>>>>>>>>> Documentation/topics/usdt-probes.rst | 1 + >>>>>>>>>> ofproto/ofproto-dpif-upcall.c | 42 +- >>>>>>>>>> utilities/automake.mk | 3 + >>>>>>>>>> utilities/usdt-scripts/flow_reval_monitor.py | 653 >>>>>>>>>> +++++++++++++++++++ >>>>>>>>>> 4 files changed, 693 insertions(+), 6 deletions(-) >>>>>>>>>> create mode 100755 utilities/usdt-scripts/flow_reval_monitor.py >>>>>>>>>> >>>>>>>>>> diff --git a/Documentation/topics/usdt-probes.rst >>>>>>>>>> b/Documentation/topics/usdt-probes.rst >>>>>>>>>> index e527f43bab..a8da9bb1f7 100644 >>>>>>>>>> --- a/Documentation/topics/usdt-probes.rst >>>>>>>>>> +++ b/Documentation/topics/usdt-probes.rst >>>>>>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``: >>>>>>>>>> - dpif_recv:recv_upcall >>>>>>>>>> - main:poll_block >>>>>>>>>> - main:run_start >>>>>>>>>> +- revalidate:flow_result >>>>>>>>>> - revalidate_ukey\_\_:entry >>>>>>>>>> - revalidate_ukey\_\_:exit >>>>>>>>>> - udpif_revalidator:start_dump >>>>>>>>> >>>>>>>>> You are missing the specific flow_result result section. This is from >>>>>>>>> the previous patch: >>>>>>>> >>>>>>>> D'oh! Thanks for catching it. I'll re-add it. >>>>>>>> >>>>>>>>> @@ -358,6 +360,27 @@ See also the ``main:run_start`` probe above. >>>>>>>>> - ``utilities/usdt-scripts/bridge_loop.bt`` >>>>>>>>> >>>>>>>>> >>>>>>>>> +probe revalidate:flow_result >>>>>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>>>>> + >>>>>>>>> +**Description**: >>>>>>>>> +This probe is triggered when the revalidator decides whether or not >>>>>>>>> to >>>>>>>>> +revalidate a flow. ``reason`` is an enum that denotes that either >>>>>>>>> the flow >>>>>>>>> +is being kept, or the reason why the flow is being deleted. The >>>>>>>>> +``flow_reval_monitor.py`` script uses this probe to notify users >>>>>>>>> when flows >>>>>>>>> +matching user-provided criteria are deleted. >>>>>>>>> + >>>>>>>>> +**Arguments**: >>>>>>>>> + >>>>>>>>> +- *arg0*: ``(enum flow_del_reason) reason`` >>>>>>>>> +- *arg1*: ``(struct udpif *) udpif`` >>>>>>>>> +- *arg2*: ``(struct udpif_key *) ukey`` >>>>>>>>> + >>>>>>>>> +**Script references**: >>>>>>>>> + >>>>>>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py`` >>>>>>>>> + >>>>>>>>> + >>>>>>>>> Adding your own probes >>>>>>>>> ---------------------- >>>>>>>>> >>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>> index b5cbeed878..97d75833f7 100644 >>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>> @@ -269,6 +269,18 @@ enum ukey_state { >>>>>>>>>> }; >>>>>>>>>> #define N_UKEY_STATES (UKEY_DELETED + 1) >>>>>>>>>> >>>>>>>>>> +enum flow_del_reason { >>>>>>>>>> + FDR_REVALIDATE = 0, /* The flow was revalidated. */ >>>>>>>>> >>>>>>>>> It was called FDR_FLOW_LIVE before, which might make more >>>>> sense. As the flow is just NOT deleted. It might or might not have >>>>> been revalidated. Thoughts? >>>>>>>> >>>>>>>> I think it had to have been revalidated if we emit the reason, because >>>>>>>> we only emit the reason code after revalidation. IE: there are many >>>>>>>> places where we skip revalidation but the flow stays live - and we >>>>>>>> don't >>>>>>>> emit reasons in those cases. >>>>>>>> >>>>>>>> So at least for this patch, it MUST have been revalidated. But maybe >>>>>>>> in >>>>>>>> the future, we would want to catch cases where the flow hasn't been. >>>>>>>> In >>>>>>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time - I >>>>>>>> think. >>>>>>>> >>>>>>>> Maybe you disagree? >>>>>>> >>>>>>> Well, it depends on how you define revalidation, it might only have >>>>> updated the counters. i.e. it all depends on ‘bool need_revalidate = >>>>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I >>>>> opted for a more general name. >>>>>>> >>>>>>>>>> + FDR_FLOW_IDLE, /* The flow went unused and was >>>>>>>>>> deleted. */ >>>>>>>>>> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to >>>>>>>>>> revalidate. */ >>>>>>>>>> + FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard >>>>>>>>>> mask. */ >>>>>>>>>> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */ >>>>>>>>>> + FDR_NO_OFPROTO, /* The flow didn't have an associated >>>>>>>>>> ofproto. */ >>>>>>>>>> + FDR_XLATION_ERROR, /* There was an error translating the >>>>>>>>>> flow. */ >>>>>>>>>> + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */ >>>>>>>>>> + FDR_FLOW_LIMIT, /* All flows being killed. */ >>>>>>>>> >>>>>>>>> Looking at the comment from Han on FDR_PURGE, and this patch >>>>> needing another spin, we should probably add it. >>>>>>>> >>>>>>>> I can do that, sure. In that case, we will need to have a new flow op >>>>>>>> added to revalidator_sweep__ so that we can catch it. But in that >>>>>>>> case, >>>>>>>> it will be a different usdt probe, so I still don't know if we need >>>>>>>> FDR_PURGE right? WDYT? >>>>>>> >>>>>>> In revalidator_sweep__() you have sort of the following: >>>>>>> >>>>>>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>>>>>> result = UKEY_DELETE; >>>>>>> } else if (!seq_mismatch) { >>>>>>> >>>>>>> And I’m afraid that if we use this tool to debug we miss the >>>>> ukey_state == UKEY_INCONSISTENT when debugging and spent a long time >>>>> figuring this out. >>>>>>> Maybe add something general like this (did not give it a lot of >>>>> thought), and only take the FDR_PURGE : FDR_UPDATE_FAIL results in the >>>>> script? >>>>>>> >>>>>>> /* 'udpif_key's are responsible for tracking the little bit of state >>>>>>> udpif >>>>>>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct revalidator >>>>>>> *revalidator, bool purge) >>>>>>> uint64_t odp_actions_stub[1024 / 8]; >>>>>>> struct ofpbuf odp_actions = >>>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub); >>>>>>> >>>>>>> - enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>> struct ukey_op ops[REVALIDATE_MAX_BATCH]; >>>>>>> struct udpif_key *ukey; >>>>>>> struct umap *umap = &udpif->ukeys[i]; >>>>>>> size_t n_ops = 0; >>>>>>> >>>>>>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>> enum ukey_state ukey_state; >>>>>>> >>>>>>> /* Handler threads could be holding a ukey lock while it >>>>>>> installs a >>>>>>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct revalidator >>>>>>> *revalidator, bool purge) >>>>>>> >>>>>>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>>>>>> result = UKEY_DELETE; >>>>>>> + reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; >>>>>>> } else if (!seq_mismatch) { >>>>>>> result = UKEY_KEEP; >>>>>>> + reason = FDR_REVALIDATE; //_KEEP >>>>>>> } else { >>>>>>> struct dpif_flow_stats stats; >>>>>>> COVERAGE_INC(revalidate_missed_dp_flow); >>>>>>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator >>>>>>> *revalidator, bool purge) >>>>>>> reval_op_init(&ops[n_ops++], result, udpif, ukey, >>>>>>> &recircs, >>>>>>> &odp_actions); >>>>>>> } >>>>>>> + OVS_USDT_PROBE(revalidator_sweep__, flow_result, >>>>>>> result, >>>>>>> + reason, udpif, ukey); >>>>>>> } >>>>>>> ovs_mutex_unlock(&ukey->mutex); >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> In addition in revalidator_sweep__() should the “enum >>>>> flow_del_reason reason = FDR_REVALIDATE;” not be moved to the >>>>> CMAP_FOR_EACH() loop? >>>>>>> >>>>>>> >>>>>>>>>> +}; >>>>>>>>>> + >>>>>>>>>> /* 'udpif_key's are responsible for tracking the little bit of >>>>>>>>>> state udpif >>>>>>>>>> * needs to do flow expiration which can't be pulled directly from >>>>>>>>>> the >>>>>>>>>> * datapath. They may be created by any handler or revalidator >>>>>>>>>> thread at any >>>>>>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, struct >>>>>>>>>> udpif_key *ukey, >>>>>>>>>> static enum reval_result >>>>>>>>>> revalidate_ukey__(struct udpif *udpif, const struct udpif_key >>>>>>>>>> *ukey, >>>>>>>>>> uint16_t tcp_flags, struct ofpbuf *odp_actions, >>>>>>>>>> - struct recirc_refs *recircs, struct xlate_cache >>>>>>>>>> *xcache) >>>>>>>>>> + struct recirc_refs *recircs, struct xlate_cache >>>>>>>>>> *xcache, >>>>>>>>>> + enum flow_del_reason *reason) >>>>>>>>>> { >>>>>>>>>> struct xlate_out *xoutp; >>>>>>>>>> struct netflow *netflow; >>>>>>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif, >>>>> const struct udpif_key *ukey, >>>>>>>>>> netflow = NULL; >>>>>>>>>> >>>>>>>>>> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) { >>>>>>>>>> + *reason = FDR_XLATION_ERROR; >>>>>>>>>> goto exit; >>>>>>>>>> } >>>>>>>>>> xoutp = &ctx.xout; >>>>>>>>>> >>>>>>>>>> if (xoutp->avoid_caching) { >>>>>>>>>> + *reason = FDR_AVOID_CACHING; >>>>>>>>>> goto exit; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif, >>>> const struct udpif_key *ukey, >>>>>>>>>> ofpbuf_clear(odp_actions); >>>>>>>>>> >>>>>>>>>> if (!ofproto) { >>>>>>>>>> + *reason = FDR_NO_OFPROTO; >>>>>>>>>> goto exit; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif, >>>> const struct udpif_key *ukey, >>>>>>>>>> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask, >>>>>>>>>> &ctx.flow, >>>>>>>>>> NULL) >>>>>>>>>> == ODP_FIT_ERROR) { >>>>>>>>>> + *reason = FDR_BAD_ODP_FIT; >>>>>>>>>> goto exit; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif, >>>> const struct udpif_key *ukey, >>>>>>>>>> * down. Note that we do not know if the datapath has ignored >>>>>>>>>> any of the >>>>>>>>>> * wildcarded bits, so we may be overly conservative here. */ >>>>>>>>>> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) { >>>>>>>>>> + *reason = FDR_FLOW_WILDCARDED; >>>>>>>>>> goto exit; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result >>>>>>>>>> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey, >>>>>>>>>> const struct dpif_flow_stats *stats, >>>>>>>>>> struct ofpbuf *odp_actions, uint64_t reval_seq, >>>>>>>>>> - struct recirc_refs *recircs) >>>>>>>>>> + struct recirc_refs *recircs, enum flow_del_reason >>>>>>>>>> *reason) >>>>>>>>>> OVS_REQUIRES(ukey->mutex) >>>>>>>>>> { >>>>>>>>>> bool need_revalidate = ukey->reval_seq != reval_seq; >>>>>>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, struct >>>>>>>>>> udpif_key *ukey, >>>>>>>>>> xlate_cache_clear(ukey->xcache); >>>>>>>>>> } >>>>>>>>>> result = revalidate_ukey__(udpif, ukey, push.tcp_flags, >>>>>>>>>> - odp_actions, recircs, >>>>>>>>>> ukey->xcache); >>>>>>>>>> - } /* else delete; too expensive to revalidate */ >>>>>>>>>> + odp_actions, recircs, >>>>>>>>>> ukey->xcache, >>>>>>>>>> + reason); >>>>>>>>>> + } else { >>>>>>>>>> + /* delete; too expensive to revalidate */ >>>>>>>>>> + *reason = FDR_TOO_EXPENSIVE; >>>>>>>>>> + } >>>>>>>>>> } else if (!push.n_packets || ukey->xcache >>>>>>>>>> || !populate_xcache(udpif, ukey, push.tcp_flags)) { >>>>>>>>>> result = UKEY_KEEP; >>>>>>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator *revalidator) >>>>>>>>>> for (f = flows; f < &flows[n_dumped]; f++) { >>>>>>>>>> long long int used = f->stats.used; >>>>>>>>>> struct recirc_refs recircs = >>>>>>>>>> RECIRC_REFS_EMPTY_INITIALIZER; >>>>>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>>>>> struct dpif_flow_stats stats = f->stats; >>>>>>>>>> enum reval_result result; >>>>>>>>>> struct udpif_key *ukey; >>>>>>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator *revalidator) >>>>>>>>>> } >>>>>>>>>> if (kill_them_all || (used && used < now - max_idle)) { >>>>>>>>>> result = UKEY_DELETE; >>>>>>>>>> + if (kill_them_all) { >>>>>>>>>> + reason = FDR_FLOW_LIMIT; >>>>>>>>>> + } else { >>>>>>>>>> + reason = FDR_FLOW_IDLE; >>>>>>>>>> + } >>>>>>>>>> } else { >>>>>>>>>> result = revalidate_ukey(udpif, ukey, &stats, >>>>>>>>>> &odp_actions, >>>>>>>>>> - reval_seq, &recircs); >>>>>>>>>> + reval_seq, &recircs, >>>>>>>>>> &reason); >>>>>>>>>> } >>>>>>>>>> ukey->dump_seq = dump_seq; >>>>>>>>>> >>>>>>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator *revalidator) >>>>>>>>>> udpif_update_flow_pps(udpif, ukey, f); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> + OVS_USDT_PROBE(revalidate, flow_result, reason, udpif, >>>>>>>>>> ukey); >>>>>> >>>>>> I have been experimenting with several upcall tracking techniques >>>>> that would make it easier to correlate upcalls with their subsequent >>>>> related events. >>>>>> To achieve that, we need (among other things) some easy-to-compare >>>>> unique value in the events. For revalidation events, I think a good >>>>> candidate would be "ukey->ufid" and so does the script in this patch. >>>>>> >>>>>> However, requiring all external tools to know the layout of "struct >>>>> udpif_key" in order to get that value makes things quite complicated >>>>> for CORE tools (e.g: retis). >>>>>> >>>>>> With all this, would you consider adding the ufid to probe payload >>>>>> directly? >>>>> >>>>> Dont see why we can not, but if we need anything else it would quickly >>>>> explode in too much arguments. I guess CORE needs a good solution for >>>>> userspace. >>>> >>>> I think having the ufid and udpif makes sense for now. Actually, maybe >>>> for the long term we can do something as an appctl command which will >>>> dump the relevant datastructures. That way a tool like this, which >>>> needs the pid anyway, can generate an appctl command against the daemon >>>> and get the exact right header layout. >>>> >>>> WDYT? >>> >>> Update on this - I have it ready to go with the exception of this >>> particular change. The USDT probe code generation seems to have a bug, >>> or maybe it just won't work using these kinds of data types. For >>> example, the bcc tools gets confused and ends up generating code for the >>> probe like: >>> >>> __attribute__((always_inline)) >>> static __always_inline int _bpf_readarg_usdt__flow_result_3(struct >> pt_regs *ctx, void *dest, size_t len) { >>> if (len != sizeof(d)) return -1; >>> { u64 __addr = ctx->r13 + 40; __asm__ __volatile__("": : :"memory"); >> uint128_t __res = 0x0; bpf_probe_read_user(&__res, sizeof(__res), >> (void *)__addr); *((d *)dest) = __res; } >>> return 0; >>> } >>> >>> This code isn't valid - the (d) variable isn't properly represented >>> here. I guess maybe there's some issue with the way they use ctypes >>> internally? It isn't clear to me. >> >> I miss some context, but is it because you point to ufid, which is a >> union (typedef union ovs_u128)? What if you point to the u32 member of >> the union? > > I'm not sure, but that doesn't seem like the best approach to me, tbh. > >>> We could work around this by making the ufid always pass as a void *, >>> and passing a length that tells the size. However, if the underlying >>> data type ever changes, it will be totally incompatible. I doubt it >>> would ever change, but at least with something stronger typed, we could >>> get a better exception that bubbled up. >> >> I do not like the pointer + length, as this adds another argument. I >> can’t remember which, but some tools had problems with 5 or more >> arguments. > > I do agree. I also want to test the latest version of bcc-tools also, > to see if it really is just related to my f38 version of bcc-tools. > That also raises a different issue, but it can be something we solve in > the interrim with a version check. > >>> I'm okay with doing it this way, but just wanted to make sure that >>> retis, et. al. would be okay with interpreting it this way. >>> >>> Alternatively, there's definitely a bug there - we could omit the >>> argument entirely and rely on pahole or even pyelftools to help extract >>> the struct layout and get the ufid from the udpif_key struct. The data >>> is already being sent, so it seems like duplication a bit to send it, >>> and becomes difficult in light of this bcc "bug?" >> >> I agree it’s overloading arguments with data already present, but I >> guess CORE can not yet read debug data from userspace >> applications. Adrian, I thought this was being worked on with the >> Pahole guys. If this is close we could just leave out this extra >> argument for now. > > That should be something doable, I think. For example, there are crates > like gimli that parse DWARF data (but not sure if it is parsing v5, > yet). > >> Alternatively, we could ask the BCC guys if this is a known issue, >> with maybe a potential workaround. Just post it here >> https://github.com/iovisor/bcc/issues the respond quickly. > > They will tell to run with the latest version, I think. The version I'm > using is what ships with Fedora 38, and seems quite out of date. ACK, will wait for your conclusion and the v9. //Eelco >> <SNIP> _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev