Eelco Chaudron <[email protected]> writes: > On 12 Feb 2024, at 15:15, Aaron Conole wrote: > >> Aaron Conole <[email protected]> writes: >> >>> Eelco Chaudron <[email protected]> writes: >>> >>>> On 2 Feb 2024, at 11:31, Adrian Moreno wrote: >>>> >>>>> On 2/1/24 10:02, Eelco Chaudron wrote: >>>>>> >>>>>> >>>>>> On 31 Jan 2024, at 18:03, Aaron Conole wrote: >>>>>> >>>>>>> Eelco Chaudron <[email protected]> writes: >>>>>>> >>>>>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote: >>>>>>>> >>>>>>>>> From: Kevin Sprague <[email protected]> >>>>>>>>> >>>>>>>>> During normal operations, it is useful to understand when a >>>>>>>>> particular flow >>>>>>>>> gets removed from the system. This can be useful when debugging >>>>>>>>> performance >>>>>>>>> issues tied to ofproto flow changes, trying to determine deployed >>>>>>>>> traffic >>>>>>>>> patterns, or while debugging dynamic systems where ports come and go. >>>>>>>>> >>>>>>>>> Prior to this change, there was a lack of visibility around flow >>>>>>>>> expiration. >>>>>>>>> The existing debugging infrastructure could tell us when a flow was >>>>>>>>> added to >>>>>>>>> the datapath, but not when it was removed or why. >>>>>>>>> >>>>>>>>> This change introduces a USDT probe at the point where the revalidator >>>>>>>>> determines that the flow should be removed. Additionally, we track >>>>>>>>> the >>>>>>>>> reason for the flow eviction and provide that information as well. >>>>>>>>> With >>>>>>>>> this change, we can track the complete flow lifecycle for the netlink >>>>>>>>> datapath by hooking the upcall tracepoint in kernel, the flow put >>>>>>>>> USDT, and >>>>>>>>> the revaldiator USDT, letting us watch as flows are added and removed >>>>>>>>> from >>>>>>>>> the kernel datapath. >>>>>>>>> >>>>>>>>> This change only enables this information via USDT probe, so it won't >>>>>>>>> be >>>>>>>>> possible to access this information any other way (see: >>>>>>>>> Documentation/topics/usdt-probes.rst). >>>>>>>>> >>>>>>>>> Also included is a script >>>>>>>>> (utilities/usdt-scripts/flow_reval_monitor.py) >>>>>>>>> which serves as a demonstration of how the new USDT probe might be >>>>>>>>> used >>>>>>>>> going forward. >>>>>>>>> >>>>>>>>> Signed-off-by: Kevin Sprague <[email protected]> >>>>>>>>> Co-authored-by: Aaron Conole <[email protected]> >>>>>>>>> Signed-off-by: Aaron Conole <[email protected]> >>>>>>>> >>>>>>>> Thanks for following this up Aaron! See comments on this patch >>>> below. I have no additional comments on patch 2. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> >>>>>>>> Eelco >>>>>>>> >>>>>>>> >>>>>>>>> --- >>>>>>>>> Documentation/topics/usdt-probes.rst | 1 + >>>>>>>>> ofproto/ofproto-dpif-upcall.c | 42 +- >>>>>>>>> utilities/automake.mk | 3 + >>>>>>>>> utilities/usdt-scripts/flow_reval_monitor.py | 653 >>>>>>>>> +++++++++++++++++++ >>>>>>>>> 4 files changed, 693 insertions(+), 6 deletions(-) >>>>>>>>> create mode 100755 utilities/usdt-scripts/flow_reval_monitor.py >>>>>>>>> >>>>>>>>> diff --git a/Documentation/topics/usdt-probes.rst >>>>>>>>> b/Documentation/topics/usdt-probes.rst >>>>>>>>> index e527f43bab..a8da9bb1f7 100644 >>>>>>>>> --- a/Documentation/topics/usdt-probes.rst >>>>>>>>> +++ b/Documentation/topics/usdt-probes.rst >>>>>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``: >>>>>>>>> - dpif_recv:recv_upcall >>>>>>>>> - main:poll_block >>>>>>>>> - main:run_start >>>>>>>>> +- revalidate:flow_result >>>>>>>>> - revalidate_ukey\_\_:entry >>>>>>>>> - revalidate_ukey\_\_:exit >>>>>>>>> - udpif_revalidator:start_dump >>>>>>>> >>>>>>>> You are missing the specific flow_result result section. This is from >>>>>>>> the previous patch: >>>>>>> >>>>>>> D'oh! Thanks for catching it. I'll re-add it. >>>>>>> >>>>>>>> @@ -358,6 +360,27 @@ See also the ``main:run_start`` probe above. >>>>>>>> - ``utilities/usdt-scripts/bridge_loop.bt`` >>>>>>>> >>>>>>>> >>>>>>>> +probe revalidate:flow_result >>>>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>>>> + >>>>>>>> +**Description**: >>>>>>>> +This probe is triggered when the revalidator decides whether or not to >>>>>>>> +revalidate a flow. ``reason`` is an enum that denotes that either the >>>>>>>> flow >>>>>>>> +is being kept, or the reason why the flow is being deleted. The >>>>>>>> +``flow_reval_monitor.py`` script uses this probe to notify users when >>>>>>>> flows >>>>>>>> +matching user-provided criteria are deleted. >>>>>>>> + >>>>>>>> +**Arguments**: >>>>>>>> + >>>>>>>> +- *arg0*: ``(enum flow_del_reason) reason`` >>>>>>>> +- *arg1*: ``(struct udpif *) udpif`` >>>>>>>> +- *arg2*: ``(struct udpif_key *) ukey`` >>>>>>>> + >>>>>>>> +**Script references**: >>>>>>>> + >>>>>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py`` >>>>>>>> + >>>>>>>> + >>>>>>>> Adding your own probes >>>>>>>> ---------------------- >>>>>>>> >>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c >>>>>>>>> b/ofproto/ofproto-dpif-upcall.c >>>>>>>>> index b5cbeed878..97d75833f7 100644 >>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c >>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>>>>>>> @@ -269,6 +269,18 @@ enum ukey_state { >>>>>>>>> }; >>>>>>>>> #define N_UKEY_STATES (UKEY_DELETED + 1) >>>>>>>>> >>>>>>>>> +enum flow_del_reason { >>>>>>>>> + FDR_REVALIDATE = 0, /* The flow was revalidated. */ >>>>>>>> >>>>>>>> It was called FDR_FLOW_LIVE before, which might make more >>>> sense. As the flow is just NOT deleted. It might or might not have >>>> been revalidated. Thoughts? >>>>>>> >>>>>>> I think it had to have been revalidated if we emit the reason, because >>>>>>> we only emit the reason code after revalidation. IE: there are many >>>>>>> places where we skip revalidation but the flow stays live - and we don't >>>>>>> emit reasons in those cases. >>>>>>> >>>>>>> So at least for this patch, it MUST have been revalidated. But maybe in >>>>>>> the future, we would want to catch cases where the flow hasn't been. In >>>>>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time - I >>>>>>> think. >>>>>>> >>>>>>> Maybe you disagree? >>>>>> >>>>>> Well, it depends on how you define revalidation, it might only have >>>> updated the counters. i.e. it all depends on ‘bool need_revalidate = >>>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I >>>> opted for a more general name. >>>>>> >>>>>>>>> + FDR_FLOW_IDLE, /* The flow went unused and was deleted. >>>>>>>>> */ >>>>>>>>> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to >>>>>>>>> revalidate. */ >>>>>>>>> + FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard >>>>>>>>> mask. */ >>>>>>>>> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */ >>>>>>>>> + FDR_NO_OFPROTO, /* The flow didn't have an associated >>>>>>>>> ofproto. */ >>>>>>>>> + FDR_XLATION_ERROR, /* There was an error translating the >>>>>>>>> flow. */ >>>>>>>>> + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */ >>>>>>>>> + FDR_FLOW_LIMIT, /* All flows being killed. */ >>>>>>>> >>>>>>>> Looking at the comment from Han on FDR_PURGE, and this patch >>>> needing another spin, we should probably add it. >>>>>>> >>>>>>> I can do that, sure. In that case, we will need to have a new flow op >>>>>>> added to revalidator_sweep__ so that we can catch it. But in that case, >>>>>>> it will be a different usdt probe, so I still don't know if we need >>>>>>> FDR_PURGE right? WDYT? >>>>>> >>>>>> In revalidator_sweep__() you have sort of the following: >>>>>> >>>>>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>>>>> result = UKEY_DELETE; >>>>>> } else if (!seq_mismatch) { >>>>>> >>>>>> And I’m afraid that if we use this tool to debug we miss the >>>> ukey_state == UKEY_INCONSISTENT when debugging and spent a long time >>>> figuring this out. >>>>>> Maybe add something general like this (did not give it a lot of >>>> thought), and only take the FDR_PURGE : FDR_UPDATE_FAIL results in the >>>> script? >>>>>> >>>>>> /* 'udpif_key's are responsible for tracking the little bit of state >>>>>> udpif >>>>>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct revalidator >>>>>> *revalidator, bool purge) >>>>>> uint64_t odp_actions_stub[1024 / 8]; >>>>>> struct ofpbuf odp_actions = >>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub); >>>>>> >>>>>> - enum flow_del_reason reason = FDR_REVALIDATE; >>>>>> struct ukey_op ops[REVALIDATE_MAX_BATCH]; >>>>>> struct udpif_key *ukey; >>>>>> struct umap *umap = &udpif->ukeys[i]; >>>>>> size_t n_ops = 0; >>>>>> >>>>>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>> enum ukey_state ukey_state; >>>>>> >>>>>> /* Handler threads could be holding a ukey lock while it >>>>>> installs a >>>>>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct revalidator >>>>>> *revalidator, bool purge) >>>>>> >>>>>> if (purge || ukey_state == UKEY_INCONSISTENT) { >>>>>> result = UKEY_DELETE; >>>>>> + reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL; >>>>>> } else if (!seq_mismatch) { >>>>>> result = UKEY_KEEP; >>>>>> + reason = FDR_REVALIDATE; //_KEEP >>>>>> } else { >>>>>> struct dpif_flow_stats stats; >>>>>> COVERAGE_INC(revalidate_missed_dp_flow); >>>>>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator >>>>>> *revalidator, bool purge) >>>>>> reval_op_init(&ops[n_ops++], result, udpif, ukey, >>>>>> &recircs, >>>>>> &odp_actions); >>>>>> } >>>>>> + OVS_USDT_PROBE(revalidator_sweep__, flow_result, result, >>>>>> + reason, udpif, ukey); >>>>>> } >>>>>> ovs_mutex_unlock(&ukey->mutex); >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> In addition in revalidator_sweep__() should the “enum >>>> flow_del_reason reason = FDR_REVALIDATE;” not be moved to the >>>> CMAP_FOR_EACH() loop? >>>>>> >>>>>> >>>>>>>>> +}; >>>>>>>>> + >>>>>>>>> /* 'udpif_key's are responsible for tracking the little bit of >>>>>>>>> state udpif >>>>>>>>> * needs to do flow expiration which can't be pulled directly from >>>>>>>>> the >>>>>>>>> * datapath. They may be created by any handler or revalidator >>>>>>>>> thread at any >>>>>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, struct >>>>>>>>> udpif_key *ukey, >>>>>>>>> static enum reval_result >>>>>>>>> revalidate_ukey__(struct udpif *udpif, const struct udpif_key *ukey, >>>>>>>>> uint16_t tcp_flags, struct ofpbuf *odp_actions, >>>>>>>>> - struct recirc_refs *recircs, struct xlate_cache >>>>>>>>> *xcache) >>>>>>>>> + struct recirc_refs *recircs, struct xlate_cache >>>>>>>>> *xcache, >>>>>>>>> + enum flow_del_reason *reason) >>>>>>>>> { >>>>>>>>> struct xlate_out *xoutp; >>>>>>>>> struct netflow *netflow; >>>>>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif, >>>> const struct udpif_key *ukey, >>>>>>>>> netflow = NULL; >>>>>>>>> >>>>>>>>> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) { >>>>>>>>> + *reason = FDR_XLATION_ERROR; >>>>>>>>> goto exit; >>>>>>>>> } >>>>>>>>> xoutp = &ctx.xout; >>>>>>>>> >>>>>>>>> if (xoutp->avoid_caching) { >>>>>>>>> + *reason = FDR_AVOID_CACHING; >>>>>>>>> goto exit; >>>>>>>>> } >>>>>>>>> >>>>>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif, >>> const struct udpif_key *ukey, >>>>>>>>> ofpbuf_clear(odp_actions); >>>>>>>>> >>>>>>>>> if (!ofproto) { >>>>>>>>> + *reason = FDR_NO_OFPROTO; >>>>>>>>> goto exit; >>>>>>>>> } >>>>>>>>> >>>>>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif, >>> const struct udpif_key *ukey, >>>>>>>>> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask, >>>>>>>>> &ctx.flow, >>>>>>>>> NULL) >>>>>>>>> == ODP_FIT_ERROR) { >>>>>>>>> + *reason = FDR_BAD_ODP_FIT; >>>>>>>>> goto exit; >>>>>>>>> } >>>>>>>>> >>>>>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif, >>> const struct udpif_key *ukey, >>>>>>>>> * down. Note that we do not know if the datapath has ignored >>>>>>>>> any of the >>>>>>>>> * wildcarded bits, so we may be overly conservative here. */ >>>>>>>>> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) { >>>>>>>>> + *reason = FDR_FLOW_WILDCARDED; >>>>>>>>> goto exit; >>>>>>>>> } >>>>>>>>> >>>>>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result >>>>>>>>> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey, >>>>>>>>> const struct dpif_flow_stats *stats, >>>>>>>>> struct ofpbuf *odp_actions, uint64_t reval_seq, >>>>>>>>> - struct recirc_refs *recircs) >>>>>>>>> + struct recirc_refs *recircs, enum flow_del_reason >>>>>>>>> *reason) >>>>>>>>> OVS_REQUIRES(ukey->mutex) >>>>>>>>> { >>>>>>>>> bool need_revalidate = ukey->reval_seq != reval_seq; >>>>>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, struct >>>>>>>>> udpif_key *ukey, >>>>>>>>> xlate_cache_clear(ukey->xcache); >>>>>>>>> } >>>>>>>>> result = revalidate_ukey__(udpif, ukey, push.tcp_flags, >>>>>>>>> - odp_actions, recircs, >>>>>>>>> ukey->xcache); >>>>>>>>> - } /* else delete; too expensive to revalidate */ >>>>>>>>> + odp_actions, recircs, >>>>>>>>> ukey->xcache, >>>>>>>>> + reason); >>>>>>>>> + } else { >>>>>>>>> + /* delete; too expensive to revalidate */ >>>>>>>>> + *reason = FDR_TOO_EXPENSIVE; >>>>>>>>> + } >>>>>>>>> } else if (!push.n_packets || ukey->xcache >>>>>>>>> || !populate_xcache(udpif, ukey, push.tcp_flags)) { >>>>>>>>> result = UKEY_KEEP; >>>>>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator *revalidator) >>>>>>>>> for (f = flows; f < &flows[n_dumped]; f++) { >>>>>>>>> long long int used = f->stats.used; >>>>>>>>> struct recirc_refs recircs = >>>>>>>>> RECIRC_REFS_EMPTY_INITIALIZER; >>>>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>>>> struct dpif_flow_stats stats = f->stats; >>>>>>>>> enum reval_result result; >>>>>>>>> struct udpif_key *ukey; >>>>>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator *revalidator) >>>>>>>>> } >>>>>>>>> if (kill_them_all || (used && used < now - max_idle)) { >>>>>>>>> result = UKEY_DELETE; >>>>>>>>> + if (kill_them_all) { >>>>>>>>> + reason = FDR_FLOW_LIMIT; >>>>>>>>> + } else { >>>>>>>>> + reason = FDR_FLOW_IDLE; >>>>>>>>> + } >>>>>>>>> } else { >>>>>>>>> result = revalidate_ukey(udpif, ukey, &stats, >>>>>>>>> &odp_actions, >>>>>>>>> - reval_seq, &recircs); >>>>>>>>> + reval_seq, &recircs, >>>>>>>>> &reason); >>>>>>>>> } >>>>>>>>> ukey->dump_seq = dump_seq; >>>>>>>>> >>>>>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator *revalidator) >>>>>>>>> udpif_update_flow_pps(udpif, ukey, f); >>>>>>>>> } >>>>>>>>> >>>>>>>>> + OVS_USDT_PROBE(revalidate, flow_result, reason, udpif, >>>>>>>>> ukey); >>>>> >>>>> I have been experimenting with several upcall tracking techniques >>>> that would make it easier to correlate upcalls with their subsequent >>>> related events. >>>>> To achieve that, we need (among other things) some easy-to-compare >>>> unique value in the events. For revalidation events, I think a good >>>> candidate would be "ukey->ufid" and so does the script in this patch. >>>>> >>>>> However, requiring all external tools to know the layout of "struct >>>> udpif_key" in order to get that value makes things quite complicated >>>> for CORE tools (e.g: retis). >>>>> >>>>> With all this, would you consider adding the ufid to probe payload >>>>> directly? >>>> >>>> Dont see why we can not, but if we need anything else it would quickly >>>> explode in too much arguments. I guess CORE needs a good solution for >>>> userspace. >>> >>> I think having the ufid and udpif makes sense for now. Actually, maybe >>> for the long term we can do something as an appctl command which will >>> dump the relevant datastructures. That way a tool like this, which >>> needs the pid anyway, can generate an appctl command against the daemon >>> and get the exact right header layout. >>> >>> WDYT? >> >> Update on this - I have it ready to go with the exception of this >> particular change. The USDT probe code generation seems to have a bug, >> or maybe it just won't work using these kinds of data types. For >> example, the bcc tools gets confused and ends up generating code for the >> probe like: >> >> __attribute__((always_inline)) >> static __always_inline int _bpf_readarg_usdt__flow_result_3(struct > pt_regs *ctx, void *dest, size_t len) { >> if (len != sizeof(d)) return -1; >> { u64 __addr = ctx->r13 + 40; __asm__ __volatile__("": : :"memory"); > uint128_t __res = 0x0; bpf_probe_read_user(&__res, sizeof(__res), > (void *)__addr); *((d *)dest) = __res; } >> return 0; >> } >> >> This code isn't valid - the (d) variable isn't properly represented >> here. I guess maybe there's some issue with the way they use ctypes >> internally? It isn't clear to me. > > I miss some context, but is it because you point to ufid, which is a > union (typedef union ovs_u128)? What if you point to the u32 member of > the union?
I'm not sure, but that doesn't seem like the best approach to me, tbh. >> We could work around this by making the ufid always pass as a void *, >> and passing a length that tells the size. However, if the underlying >> data type ever changes, it will be totally incompatible. I doubt it >> would ever change, but at least with something stronger typed, we could >> get a better exception that bubbled up. > > I do not like the pointer + length, as this adds another argument. I > can’t remember which, but some tools had problems with 5 or more > arguments. I do agree. I also want to test the latest version of bcc-tools also, to see if it really is just related to my f38 version of bcc-tools. That also raises a different issue, but it can be something we solve in the interrim with a version check. >> I'm okay with doing it this way, but just wanted to make sure that >> retis, et. al. would be okay with interpreting it this way. >> >> Alternatively, there's definitely a bug there - we could omit the >> argument entirely and rely on pahole or even pyelftools to help extract >> the struct layout and get the ufid from the udpif_key struct. The data >> is already being sent, so it seems like duplication a bit to send it, >> and becomes difficult in light of this bcc "bug?" > > I agree it’s overloading arguments with data already present, but I > guess CORE can not yet read debug data from userspace > applications. Adrian, I thought this was being worked on with the > Pahole guys. If this is close we could just leave out this extra > argument for now. That should be something doable, I think. For example, there are crates like gimli that parse DWARF data (but not sure if it is parsing v5, yet). > Alternatively, we could ask the BCC guys if this is a known issue, > with maybe a potential workaround. Just post it here > https://github.com/iovisor/bcc/issues the respond quickly. They will tell to run with the latest version, I think. The version I'm using is what ships with Fedora 38, and seems quite out of date. > <SNIP> _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
