On 21 Feb 2024, at 17:45, Adrian Moreno wrote:
> On 2/21/24 16:42, Eelco Chaudron wrote: >> >> >> On 21 Feb 2024, at 16:03, Adrian Moreno wrote: >> >>> On 2/21/24 15:49, Aaron Conole wrote: >>>> Adrian Moreno <amore...@redhat.com> writes: >>>> >>>>> On 2/20/24 19:06, Aaron Conole wrote: >>>>>> Eelco Chaudron <echau...@redhat.com> writes: >>>>>> >>>>>>> On 19 Feb 2024, at 19:57, Aaron Conole wrote: >>>>>>> >>>>>>>> Eelco Chaudron <echau...@redhat.com> writes: >>>>>>>> >>>>>>>>> On 12 Feb 2024, at 15:15, Aaron Conole wrote: >>>>>>>>> >>>>>>>>>> Aaron Conole <acon...@redhat.com> writes: >>>>>>>>>> >>>>>>>>>>> Eelco Chaudron <echau...@redhat.com> writes: >>>>>>>>>>> >>>>>>>>>>>> On 2 Feb 2024, at 11:31, Adrian Moreno wrote: >>>>>>>>>>>> >>>>>>>>>>>>> On 2/1/24 10:02, Eelco Chaudron wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 31 Jan 2024, at 18:03, Aaron Conole wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Eelco Chaudron <echau...@redhat.com> writes: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> From: Kevin Sprague <ksprague0...@gmail.com> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> During normal operations, it is useful to understand when a >>>>>>>>>>>>>>>>> particular flow >>>>>>>>>>>>>>>>> gets removed from the system. This can be useful when >>>>>>>>>>>>>>>>> debugging performance >>>>>>>>>>>>>>>>> issues tied to ofproto flow changes, trying to determine >>>>>>>>>>>>>>>>> deployed traffic >>>>>>>>>>>>>>>>> patterns, or while debugging dynamic systems where ports come >>>>>>>>>>>>>>>>> and go. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Prior to this change, there was a lack of visibility around >>>>>>>>>>>>>>>>> flow expiration. >>>>>>>>>>>>>>>>> The existing debugging infrastructure could tell us when a >>>>>>>>>>>>>>>>> flow was added to >>>>>>>>>>>>>>>>> the datapath, but not when it was removed or why. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This change introduces a USDT probe at the point where the >>>>>>>>>>>>>>>>> revalidator >>>>>>>>>>>>>>>>> determines that the flow should be removed. Additionally, we >>>>>>>>>>>>>>>>> track the >>>>>>>>>>>>>>>>> reason for the flow eviction and provide that information as >>>>>>>>>>>>>>>>> well. With >>>>>>>>>>>>>>>>> this change, we can track the complete flow lifecycle for the >>>>>>>>>>>>>>>>> netlink >>>>>>>>>>>>>>>>> datapath by hooking the upcall tracepoint in kernel, the flow >>>>>>>>>>>>>>>>> put USDT, and >>>>>>>>>>>>>>>>> the revaldiator USDT, letting us watch as flows are added and >>>>>>>>>>>>>>>>> removed from >>>>>>>>>>>>>>>>> the kernel datapath. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This change only enables this information via USDT probe, so >>>>>>>>>>>>>>>>> it won't be >>>>>>>>>>>>>>>>> possible to access this information any other way (see: >>>>>>>>>>>>>>>>> Documentation/topics/usdt-probes.rst). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Also included is a script >>>>>>>>>>>>>>>>> (utilities/usdt-scripts/flow_reval_monitor.py) >>>>>>>>>>>>>>>>> which serves as a demonstration of how the new USDT probe >>>>>>>>>>>>>>>>> might be used >>>>>>>>>>>>>>>>> going forward. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Signed-off-by: Kevin Sprague <ksprague0...@gmail.com> >>>>>>>>>>>>>>>>> Co-authored-by: Aaron Conole <acon...@redhat.com> >>>>>>>>>>>>>>>>> Signed-off-by: Aaron Conole <acon...@redhat.com> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks for following this up Aaron! See comments on this patch >>>>>>>>>>>> below. I have no additional comments on patch 2. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Cheers, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Eelco >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>> Documentation/topics/usdt-probes.rst | 1 + >>>>>>>>>>>>>>>>> ofproto/ofproto-dpif-upcall.c | 42 +- >>>>>>>>>>>>>>>>> utilities/automake.mk | 3 + >>>>>>>>>>>>>>>>> utilities/usdt-scripts/flow_reval_monitor.py | 653 >>>>>>>>>>>>>>>>> +++++++++++++++++++ >>>>>>>>>>>>>>>>> 4 files changed, 693 insertions(+), 6 deletions(-) >>>>>>>>>>>>>>>>> create mode 100755 >>>>>>>>>>>>>>>>> utilities/usdt-scripts/flow_reval_monitor.py >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> diff --git a/Documentation/topics/usdt-probes.rst >>>>> b/Documentation/topics/usdt-probes.rst >>>>>>>>>>>>>>>>> index e527f43bab..a8da9bb1f7 100644 >>>>>>>>>>>>>>>>> --- a/Documentation/topics/usdt-probes.rst >>>>>>>>>>>>>>>>> +++ b/Documentation/topics/usdt-probes.rst >>>>>>>>>>>>>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``: >>>>>>>>>>>>>>>>> - dpif_recv:recv_upcall >>>>>>>>>>>>>>>>> - main:poll_block >>>>>>>>>>>>>>>>> - main:run_start >>>>>>>>>>>>>>>>> +- revalidate:flow_result >>>>>>>>>>>>>>>>> - revalidate_ukey\_\_:entry >>>>>>>>>>>>>>>>> - revalidate_ukey\_\_:exit >>>>>>>>>>>>>>>>> - udpif_revalidator:start_dump >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> You are missing the specific flow_result result >>>>> section. This is from the previous patch: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> D'oh! Thanks for catching it. I'll re-add it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> @@ -358,6 +360,27 @@ See also the ``main:run_start`` probe >>>>>>>>>>>>>>>> above. >>>>>>>>>>>>>>>> - ``utilities/usdt-scripts/bridge_loop.bt`` >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> +probe revalidate:flow_result >>>>>>>>>>>>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> +**Description**: >>>>>>>>>>>>>>>> +This probe is triggered when the revalidator decides whether >>>>>>>>>>>>>>>> or not to >>>>>>>>>>>>>>>> +revalidate a flow. ``reason`` is an enum that denotes that >>>>>>>>>>>>>>>> either the flow >>>>>>>>>>>>>>>> +is being kept, or the reason why the flow is being deleted. >>>>>>>>>>>>>>>> The >>>>>>>>>>>>>>>> +``flow_reval_monitor.py`` script uses this probe to notify >>>>>>>>>>>>>>>> users when flows >>>>>>>>>>>>>>>> +matching user-provided criteria are deleted. >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> +**Arguments**: >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> +- *arg0*: ``(enum flow_del_reason) reason`` >>>>>>>>>>>>>>>> +- *arg1*: ``(struct udpif *) udpif`` >>>>>>>>>>>>>>>> +- *arg2*: ``(struct udpif_key *) ukey`` >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> +**Script references**: >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py`` >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>> Adding your own probes >>>>>>>>>>>>>>>> ---------------------- >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>>>>>>>> b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>>>>>>>> index b5cbeed878..97d75833f7 100644 >>>>>>>>>>>>>>>>> --- a/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>>>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c >>>>>>>>>>>>>>>>> @@ -269,6 +269,18 @@ enum ukey_state { >>>>>>>>>>>>>>>>> }; >>>>>>>>>>>>>>>>> #define N_UKEY_STATES (UKEY_DELETED + 1) >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +enum flow_del_reason { >>>>>>>>>>>>>>>>> + FDR_REVALIDATE = 0, /* The flow was revalidated. */ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> It was called FDR_FLOW_LIVE before, which might make more >>>>>>>>>>>> sense. As the flow is just NOT deleted. It might or might not have >>>>>>>>>>>> been revalidated. Thoughts? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I think it had to have been revalidated if we emit the reason, >>>>>>>>>>>>>>> because >>>>>>>>>>>>>>> we only emit the reason code after revalidation. IE: there are >>>>>>>>>>>>>>> many >>>>>>>>>>>>>>> places where we skip revalidation but the flow stays live - and >>>>>>>>>>>>>>> we don't >>>>>>>>>>>>>>> emit reasons in those cases. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> So at least for this patch, it MUST have been revalidated. But >>>>>>>>>>>>>>> maybe in >>>>>>>>>>>>>>> the future, we would want to catch cases where the flow hasn't >>>>>>>>>>>>>>> been. In >>>>>>>>>>>>>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time >>>>>>>>>>>>>>> - I >>>>>>>>>>>>>>> think. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Maybe you disagree? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Well, it depends on how you define revalidation, it might only >>>>>>>>>>>>>> have >>>>>>>>>>>> updated the counters. i.e. it all depends on ‘bool need_revalidate >>>>>>>>>>>> = >>>>>>>>>>>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I >>>>>>>>>>>> opted for a more general name. >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> + FDR_FLOW_IDLE, /* The flow went unused and was >>>>>>>>>>>>>>>>> deleted. */ >>>>>>>>>>>>>>>>> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to >>>>>>>>>>>>>>>>> revalidate. */ >>>>>>>>>>>>>>>>> + FDR_FLOW_WILDCARDED, /* The flow needed a narrower >>>>>>>>>>>>>>>>> wildcard mask. */ >>>>>>>>>>>>>>>>> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow >>>>>>>>>>>>>>>>> fit. */ >>>>>>>>>>>>>>>>> + FDR_NO_OFPROTO, /* The flow didn't have an >>>>>>>>>>>>>>>>> associated ofproto. */ >>>>>>>>>>>>>>>>> + FDR_XLATION_ERROR, /* There was an error >>>>>>>>>>>>>>>>> translating the flow. */ >>>>>>>>>>>>>>>>> + FDR_AVOID_CACHING, /* Flow deleted to avoid >>>>>>>>>>>>>>>>> caching. */ >>>>>>>>>>>>>>>>> + FDR_FLOW_LIMIT, /* All flows being killed. */ >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Looking at the comment from Han on FDR_PURGE, and this patch >>>>>>>>>>>> needing another spin, we should probably add it. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I can do that, sure. In that case, we will need to have a new >>>>>>>>>>>>>>> flow op >>>>>>>>>>>>>>> added to revalidator_sweep__ so that we can catch it. But in >>>>>>>>>>>>>>> that case, >>>>>>>>>>>>>>> it will be a different usdt probe, so I still don't know if we >>>>>>>>>>>>>>> need >>>>>>>>>>>>>>> FDR_PURGE right? WDYT? >>>>>>>>>>>>>> >>>>>>>>>>>>>> In revalidator_sweep__() you have sort of the following: >>>>>>>>>>>>>> >>>>>>>>>>>>>> if (purge || ukey_state == >>>>>>>>>>>>>> UKEY_INCONSISTENT) { >>>>>>>>>>>>>> result = UKEY_DELETE; >>>>>>>>>>>>>> } else if (!seq_mismatch) { >>>>>>>>>>>>>> >>>>>>>>>>>>>> And I’m afraid that if we use this tool to debug we miss the >>>>>>>>>>>> ukey_state == UKEY_INCONSISTENT when debugging and spent a long >>>>>>>>>>>> time >>>>>>>>>>>> figuring this out. >>>>>>>>>>>>>> Maybe add something general like this (did not give it a lot of >>>>>>>>>>>> thought), and only take the FDR_PURGE : FDR_UPDATE_FAIL results in >>>>>>>>>>>> the >>>>>>>>>>>> script? >>>>>>>>>>>>>> >>>>>>>>>>>>>> /* 'udpif_key's are responsible for tracking the little bit >>>>>>>>>>>>>> of state udpif >>>>>>>>>>>>>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct >>>>> revalidator *revalidator, bool purge) >>>>>>>>>>>>>> uint64_t odp_actions_stub[1024 / 8]; >>>>>>>>>>>>>> struct ofpbuf odp_actions = >>>>>>>>>>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub); >>>>>>>>>>>>>> >>>>>>>>>>>>>> - enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>>>>>>>>> struct ukey_op ops[REVALIDATE_MAX_BATCH]; >>>>>>>>>>>>>> struct udpif_key *ukey; >>>>>>>>>>>>>> struct umap *umap = &udpif->ukeys[i]; >>>>>>>>>>>>>> size_t n_ops = 0; >>>>>>>>>>>>>> >>>>>>>>>>>>>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) { >>>>>>>>>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>>>>>>>>> enum ukey_state ukey_state; >>>>>>>>>>>>>> >>>>>>>>>>>>>> /* Handler threads could be holding a ukey lock >>>>>>>>>>>>>> while it installs a >>>>>>>>>>>>>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct >>>>> revalidator *revalidator, bool purge) >>>>>>>>>>>>>> >>>>>>>>>>>>>> if (purge || ukey_state == >>>>>>>>>>>>>> UKEY_INCONSISTENT) { >>>>>>>>>>>>>> result = UKEY_DELETE; >>>>>>>>>>>>>> + reason = purge ? FDR_PURGE : >>>>>>>>>>>>>> FDR_UPDATE_FAIL; >>>>>>>>>>>>>> } else if (!seq_mismatch) { >>>>>>>>>>>>>> result = UKEY_KEEP; >>>>>>>>>>>>>> + reason = FDR_REVALIDATE; //_KEEP >>>>>>>>>>>>>> } else { >>>>>>>>>>>>>> struct dpif_flow_stats stats; >>>>>>>>>>>>>> COVERAGE_INC(revalidate_missed_dp_flow); >>>>>>>>>>>>>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator >>>>>>>>>>>>>> *revalidator, bool purge) >>>>>>>>>>>>>> reval_op_init(&ops[n_ops++], result, >>>>>>>>>>>>>> udpif, ukey, &recircs, >>>>>>>>>>>>>> &odp_actions); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> + OVS_USDT_PROBE(revalidator_sweep__, >>>>>>>>>>>>>> flow_result, result, >>>>>>>>>>>>>> + reason, udpif, ukey); >>>>>>>>>>>>>> } >>>>>>>>>>>>>> ovs_mutex_unlock(&ukey->mutex); >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> In addition in revalidator_sweep__() should the “enum >>>>>>>>>>>> flow_del_reason reason = FDR_REVALIDATE;” not be moved to the >>>>>>>>>>>> CMAP_FOR_EACH() loop? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> +}; >>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>> /* 'udpif_key's are responsible for tracking the little >>>>>>>>>>>>>>>>> bit of state udpif >>>>>>>>>>>>>>>>> * needs to do flow expiration which can't be pulled >>>>>>>>>>>>>>>>> directly from the >>>>>>>>>>>>>>>>> * datapath. They may be created by any handler or >>>>>>>>>>>>>>>>> revalidator thread at any >>>>>>>>>>>>>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, >>>>>>>>>>>>>>>>> struct udpif_key *ukey, >>>>>>>>>>>>>>>>> static enum reval_result >>>>>>>>>>>>>>>>> revalidate_ukey__(struct udpif *udpif, const struct >>>>>>>>>>>>>>>>> udpif_key *ukey, >>>>>>>>>>>>>>>>> uint16_t tcp_flags, struct ofpbuf >>>>>>>>>>>>>>>>> *odp_actions, >>>>>>>>>>>>>>>>> - struct recirc_refs *recircs, struct >>>>>>>>>>>>>>>>> xlate_cache *xcache) >>>>>>>>>>>>>>>>> + struct recirc_refs *recircs, struct >>>>>>>>>>>>>>>>> xlate_cache *xcache, >>>>>>>>>>>>>>>>> + enum flow_del_reason *reason) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> struct xlate_out *xoutp; >>>>>>>>>>>>>>>>> struct netflow *netflow; >>>>>>>>>>>>>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif, >>>>>>>>>>>> const struct udpif_key *ukey, >>>>>>>>>>>>>>>>> netflow = NULL; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) { >>>>>>>>>>>>>>>>> + *reason = FDR_XLATION_ERROR; >>>>>>>>>>>>>>>>> goto exit; >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> xoutp = &ctx.xout; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if (xoutp->avoid_caching) { >>>>>>>>>>>>>>>>> + *reason = FDR_AVOID_CACHING; >>>>>>>>>>>>>>>>> goto exit; >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif, >>>>>>>>>>> const struct udpif_key *ukey, >>>>>>>>>>>>>>>>> ofpbuf_clear(odp_actions); >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> if (!ofproto) { >>>>>>>>>>>>>>>>> + *reason = FDR_NO_OFPROTO; >>>>>>>>>>>>>>>>> goto exit; >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif, >>>>>>>>>>> const struct udpif_key *ukey, >>>>>>>>>>>>>>>>> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, >>>>>>>>>>>>>>>>> &dp_mask, &ctx.flow, >>>>>>>>>>>>>>>>> NULL) >>>>>>>>>>>>>>>>> == ODP_FIT_ERROR) { >>>>>>>>>>>>>>>>> + *reason = FDR_BAD_ODP_FIT; >>>>>>>>>>>>>>>>> goto exit; >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif, >>>>>>>>>>> const struct udpif_key *ukey, >>>>>>>>>>>>>>>>> * down. Note that we do not know if the datapath >>>>>>>>>>>>>>>>> has ignored any of the >>>>>>>>>>>>>>>>> * wildcarded bits, so we may be overly conservative >>>>>>>>>>>>>>>>> here. */ >>>>>>>>>>>>>>>>> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) { >>>>>>>>>>>>>>>>> + *reason = FDR_FLOW_WILDCARDED; >>>>>>>>>>>>>>>>> goto exit; >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result >>>>>>>>>>>>>>>>> revalidate_ukey(struct udpif *udpif, struct udpif_key >>>>>>>>>>>>>>>>> *ukey, >>>>>>>>>>>>>>>>> const struct dpif_flow_stats *stats, >>>>>>>>>>>>>>>>> struct ofpbuf *odp_actions, uint64_t >>>>>>>>>>>>>>>>> reval_seq, >>>>>>>>>>>>>>>>> - struct recirc_refs *recircs) >>>>>>>>>>>>>>>>> + struct recirc_refs *recircs, enum >>>>>>>>>>>>>>>>> flow_del_reason *reason) >>>>>>>>>>>>>>>>> OVS_REQUIRES(ukey->mutex) >>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>> bool need_revalidate = ukey->reval_seq != reval_seq; >>>>>>>>>>>>>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, >>>>>>>>>>>>>>>>> struct udpif_key *ukey, >>>>>>>>>>>>>>>>> xlate_cache_clear(ukey->xcache); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> result = revalidate_ukey__(udpif, ukey, >>>>>>>>>>>>>>>>> push.tcp_flags, >>>>>>>>>>>>>>>>> - odp_actions, recircs, >>>>>>>>>>>>>>>>> ukey->xcache); >>>>>>>>>>>>>>>>> - } /* else delete; too expensive to revalidate */ >>>>>>>>>>>>>>>>> + odp_actions, recircs, >>>>>>>>>>>>>>>>> ukey->xcache, >>>>>>>>>>>>>>>>> + reason); >>>>>>>>>>>>>>>>> + } else { >>>>>>>>>>>>>>>>> + /* delete; too expensive to revalidate */ >>>>>>>>>>>>>>>>> + *reason = FDR_TOO_EXPENSIVE; >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> } else if (!push.n_packets || ukey->xcache >>>>>>>>>>>>>>>>> || !populate_xcache(udpif, ukey, >>>>>>>>>>>>>>>>> push.tcp_flags)) { >>>>>>>>>>>>>>>>> result = UKEY_KEEP; >>>>>>>>>>>>>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator >>>>>>>>>>>>>>>>> *revalidator) >>>>>>>>>>>>>>>>> for (f = flows; f < &flows[n_dumped]; f++) { >>>>>>>>>>>>>>>>> long long int used = f->stats.used; >>>>>>>>>>>>>>>>> struct recirc_refs recircs = >>>>>>>>>>>>>>>>> RECIRC_REFS_EMPTY_INITIALIZER; >>>>>>>>>>>>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE; >>>>>>>>>>>>>>>>> struct dpif_flow_stats stats = f->stats; >>>>>>>>>>>>>>>>> enum reval_result result; >>>>>>>>>>>>>>>>> struct udpif_key *ukey; >>>>>>>>>>>>>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator >>>>>>>>>>>>>>>>> *revalidator) >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> if (kill_them_all || (used && used < now - >>>>>>>>>>>>>>>>> max_idle)) { >>>>>>>>>>>>>>>>> result = UKEY_DELETE; >>>>>>>>>>>>>>>>> + if (kill_them_all) { >>>>>>>>>>>>>>>>> + reason = FDR_FLOW_LIMIT; >>>>>>>>>>>>>>>>> + } else { >>>>>>>>>>>>>>>>> + reason = FDR_FLOW_IDLE; >>>>>>>>>>>>>>>>> + } >>>>>>>>>>>>>>>>> } else { >>>>>>>>>>>>>>>>> result = revalidate_ukey(udpif, ukey, >>>>>>>>>>>>>>>>> &stats, &odp_actions, >>>>>>>>>>>>>>>>> - reval_seq, >>>>>>>>>>>>>>>>> &recircs); >>>>>>>>>>>>>>>>> + reval_seq, >>>>>>>>>>>>>>>>> &recircs, &reason); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> ukey->dump_seq = dump_seq; >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator >>>>>>>>>>>>>>>>> *revalidator) >>>>>>>>>>>>>>>>> udpif_update_flow_pps(udpif, ukey, f); >>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> + OVS_USDT_PROBE(revalidate, flow_result, reason, >>>>>>>>>>>>>>>>> udpif, ukey); >>>>>>>>>>>>> >>>>>>>>>>>>> I have been experimenting with several upcall tracking techniques >>>>>>>>>>>> that would make it easier to correlate upcalls with their >>>>>>>>>>>> subsequent >>>>>>>>>>>> related events. >>>>>>>>>>>>> To achieve that, we need (among other things) some easy-to-compare >>>>>>>>>>>> unique value in the events. For revalidation events, I think a good >>>>>>>>>>>> candidate would be "ukey->ufid" and so does the script in this >>>>>>>>>>>> patch. >>>>>>>>>>>>> >>>>>>>>>>>>> However, requiring all external tools to know the layout of >>>>>>>>>>>>> "struct >>>>>>>>>>>> udpif_key" in order to get that value makes things quite >>>>>>>>>>>> complicated >>>>>>>>>>>> for CORE tools (e.g: retis). >>>>>>>>>>>>> >>>>>>>>>>>>> With all this, would you consider adding the ufid to probe >>>>>>>>>>>>> payload directly? >>>>>>>>>>>> >>>>>>>>>>>> Dont see why we can not, but if we need anything else it would >>>>>>>>>>>> quickly >>>>>>>>>>>> explode in too much arguments. I guess CORE needs a good solution >>>>>>>>>>>> for >>>>>>>>>>>> userspace. >>>>>>>>>>> >>>>>>>>>>> I think having the ufid and udpif makes sense for now. Actually, >>>>>>>>>>> maybe >>>>>>>>>>> for the long term we can do something as an appctl command which >>>>>>>>>>> will >>>>>>>>>>> dump the relevant datastructures. That way a tool like this, which >>>>>>>>>>> needs the pid anyway, can generate an appctl command against the >>>>>>>>>>> daemon >>>>>>>>>>> and get the exact right header layout. >>>>>>>>>>> >>>>>>>>>>> WDYT? >>>>>>>>>> >>>>>>>>>> Update on this - I have it ready to go with the exception of this >>>>>>>>>> particular change. The USDT probe code generation seems to have a >>>>>>>>>> bug, >>>>>>>>>> or maybe it just won't work using these kinds of data types. For >>>>>>>>>> example, the bcc tools gets confused and ends up generating code for >>>>>>>>>> the >>>>>>>>>> probe like: >>>>>>>>>> >>>>>>>>>> __attribute__((always_inline)) >>>>>>>>>> static __always_inline int _bpf_readarg_usdt__flow_result_3(struct >>>>>>>>> pt_regs *ctx, void *dest, size_t len) { >>>>>>>>>> if (len != sizeof(d)) return -1; >>>>>>>>>> { u64 __addr = ctx->r13 + 40; __asm__ __volatile__("": : :"memory"); >>>>>>>>> uint128_t __res = 0x0; bpf_probe_read_user(&__res, sizeof(__res), >>>>>>>>> (void *)__addr); *((d *)dest) = __res; } >>>>>>>>>> return 0; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> This code isn't valid - the (d) variable isn't properly represented >>>>>>>>>> here. I guess maybe there's some issue with the way they use ctypes >>>>>>>>>> internally? It isn't clear to me. >>>>>>>>> >>>>>>>>> I miss some context, but is it because you point to ufid, which is a >>>>>>>>> union (typedef union ovs_u128)? What if you point to the u32 member of >>>>>>>>> the union? >>>>>>>> >>>>>>>> I'm not sure, but that doesn't seem like the best approach to me, tbh. >>>>>>>> >>>>>>>>>> We could work around this by making the ufid always pass as a void *, >>>>>>>>>> and passing a length that tells the size. However, if the underlying >>>>>>>>>> data type ever changes, it will be totally incompatible. I doubt it >>>>>>>>>> would ever change, but at least with something stronger typed, we >>>>>>>>>> could >>>>>>>>>> get a better exception that bubbled up. >>>>>>>>> >>>>>>>>> I do not like the pointer + length, as this adds another argument. I >>>>>>>>> can’t remember which, but some tools had problems with 5 or more >>>>>>>>> arguments. >>>>>>>> >>>>>>>> I do agree. I also want to test the latest version of bcc-tools also, >>>>>>>> to see if it really is just related to my f38 version of bcc-tools. >>>>>>>> That also raises a different issue, but it can be something we solve in >>>>>>>> the interrim with a version check. >>>>>>>> >>>>>>>>>> I'm okay with doing it this way, but just wanted to make sure that >>>>>>>>>> retis, et. al. would be okay with interpreting it this way. >>>>>>>>>> >>>>>>>>>> Alternatively, there's definitely a bug there - we could omit the >>>>>>>>>> argument entirely and rely on pahole or even pyelftools to help >>>>>>>>>> extract >>>>>>>>>> the struct layout and get the ufid from the udpif_key struct. The >>>>>>>>>> data >>>>>>>>>> is already being sent, so it seems like duplication a bit to send it, >>>>>>>>>> and becomes difficult in light of this bcc "bug?" >>>>>>>>> >>>>>>>>> I agree it’s overloading arguments with data already present, but I >>>>>>>>> guess CORE can not yet read debug data from userspace >>>>>>>>> applications. Adrian, I thought this was being worked on with the >>>>>>>>> Pahole guys. If this is close we could just leave out this extra >>>>>>>>> argument for now. >>>>>>>> >>>>>>>> That should be something doable, I think. For example, there are >>>>>>>> crates >>>>>>>> like gimli that parse DWARF data (but not sure if it is parsing v5, >>>>>>>> yet). >>>>>>>> >>>>>>>>> Alternatively, we could ask the BCC guys if this is a known issue, >>>>>>>>> with maybe a potential workaround. Just post it here >>>>>>>>> https://github.com/iovisor/bcc/issues the respond quickly. >>>>>>>> >>>>>>>> They will tell to run with the latest version, I think. The version >>>>>>>> I'm >>>>>>>> using is what ships with Fedora 38, and seems quite out of date. >>>>>>> >>>>>>> ACK, will wait for your conclusion and the v9. >>>>>> After tying to get it working with latest bcc-tools, the behavior is >>>>>> actually much worse. Whereas before the flow_reval_monitor script would >>>>>> fail during the bpf compilation (with a clear error message), the latest >>>>>> BCC tools will crash the python interpreter. >>>>>> I'm going to omit the ufid details. I think CO-RE will have support >>>>>> anyway for this kinds of output, so v9 I will keep the args as-is and >>>>>> when the support gets added to decode the debuginfo packages to pull >>>>>> those structures, they will get the support. >>>>>> >>>>> >>>>> Sorry, maybe I missed something. What support will CO-RE have? >>>> >>>> I guess, my understanding was the pahole work would be the thing CO-RE >>>> needs, but after reading your comments, maybe that isn't true. >>>> >>> >>> Doing CO-RE with some BTF generated by pahole from a debug file is just >>> possible in theory. It looks doable by using libbpf directly and even >>> there, CO-RE was meant to be used on the vmlinux (or module) BTF. I haven't >>> tried it yet. >>> Common ebpf frameworks such as aya, cillim/bpf, etc don't seem to expose >>> that >>> functionality AFAICS. >>> >>> So, not putting the ufid makes things difficult for CO-RE tools for the >>> forseable future. >>> >>> Also, Eelco, could you expand on the 5 argument limitation? I see >>> revalidate_ukey__:entry has 6 arguments and seems to work. >> >> The only thing I could remember is that I had problems with some tooling, >> but can’t recall which. So it might have been related to the tool, and not >> the actual tracepoint insertion part. >> >>> OTOH, although the ufid is variable-length, could we assume that at most >>> it's 128 bits? That's what the uAPI states. If so, could we send a raw >>> pointer and assume that length? I don't know any reason why OVS will at >>> some point reduce the size but if it does, it's likely that a 128bit buffer >>> will still be used (istead of using dynamic memory). >> >> I feel like duplicating arguments to make it easier for external tools, >> sounds like a bad design principle. >> > > If this was just another function, or even an API, I'd agree. > But for a tracepoint, exposing all the data that a user might possibly need > without requiring them to do struct introspection is not that bad. The > usdt:recv_upcall and both kernel tracepoints are some examples. As it is a debug hook, which could introduce extra code/ache line misses, I prefer the need for extra struct introspection at the debugging script. And yes the usdt:recv_upcall (and some others) were my own learning mistakes ;) But the problem is that once we add them, we should try (we do not guarantee this) to keep them stable. >>> Sorry for insisting, it's just that I would really like to consume this >>> from the retis side. >> >> How far away are we from dwarf to BTF, which then can be used by CO-RE? >> Maybe we should try this first, and if this is a dead-end we can re-visit >> and add the extra argument. If we add the argument now, we can not remove it >> later. >> > > I'd say we're pretty far. Even if possible, it would only be doable for > programs that use libbpf directly, which are very rare (most use cillim/ebpf > or some other framework). > > Even for retis, I don't think that calling pahole would be an option (other > than for a PoC), we would have to cook our own BTF encoder and use gimli to > extract DWARF information. Doesn’t CO-RE require libbpf anyway due to its need for BTF? However, if we keep making these changes because CO-RE is missing a major feature it might never get solved. I think we should figure out how CO-RE could solve this for USDTs. I think pahole uses the dwarves libraries, so maybe part of the functionality can be moved there. Maybe we can get the CO-RE people involved and get them to solve this? Cheers, Eelco _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev