Eelco Chaudron <[email protected]> writes:
> On 1 Feb 2024, at 18:28, Aaron Conole wrote:
>
>> Eelco Chaudron <[email protected]> writes:
>>
>>> On 31 Jan 2024, at 18:03, Aaron Conole wrote:
>>>
>>>> Eelco Chaudron <[email protected]> writes:
>>>>
>>>>> On 25 Jan 2024, at 21:55, Aaron Conole wrote:
>>>>>
>>>>>> From: Kevin Sprague <[email protected]>
>>>>>>
>>>>>> During normal operations, it is useful to understand when a particular
>>>>>> flow
>>>>>> gets removed from the system. This can be useful when debugging
>>>>>> performance
>>>>>> issues tied to ofproto flow changes, trying to determine deployed traffic
>>>>>> patterns, or while debugging dynamic systems where ports come and go.
>>>>>>
>>>>>> Prior to this change, there was a lack of visibility around flow
>>>>>> expiration.
>>>>>> The existing debugging infrastructure could tell us when a flow was
>>>>>> added to
>>>>>> the datapath, but not when it was removed or why.
>>>>>>
>>>>>> This change introduces a USDT probe at the point where the revalidator
>>>>>> determines that the flow should be removed. Additionally, we track the
>>>>>> reason for the flow eviction and provide that information as well. With
>>>>>> this change, we can track the complete flow lifecycle for the netlink
>>>>>> datapath by hooking the upcall tracepoint in kernel, the flow put USDT,
>>>>>> and
>>>>>> the revaldiator USDT, letting us watch as flows are added and removed
>>>>>> from
>>>>>> the kernel datapath.
>>>>>>
>>>>>> This change only enables this information via USDT probe, so it won't be
>>>>>> possible to access this information any other way (see:
>>>>>> Documentation/topics/usdt-probes.rst).
>>>>>>
>>>>>> Also included is a script (utilities/usdt-scripts/flow_reval_monitor.py)
>>>>>> which serves as a demonstration of how the new USDT probe might be used
>>>>>> going forward.
>>>>>>
>>>>>> Signed-off-by: Kevin Sprague <[email protected]>
>>>>>> Co-authored-by: Aaron Conole <[email protected]>
>>>>>> Signed-off-by: Aaron Conole <[email protected]>
>>>>>
>>>>> Thanks for following this up Aaron! See comments on this patch below. I
>>>>> have no additional comments on patch 2.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Eelco
>>>>>
>>>>>
>>>>>> ---
>>>>>> Documentation/topics/usdt-probes.rst | 1 +
>>>>>> ofproto/ofproto-dpif-upcall.c | 42 +-
>>>>>> utilities/automake.mk | 3 +
>>>>>> utilities/usdt-scripts/flow_reval_monitor.py | 653 +++++++++++++++++++
>>>>>> 4 files changed, 693 insertions(+), 6 deletions(-)
>>>>>> create mode 100755 utilities/usdt-scripts/flow_reval_monitor.py
>>>>>>
>>>>>> diff --git a/Documentation/topics/usdt-probes.rst
>>>>>> b/Documentation/topics/usdt-probes.rst
>>>>>> index e527f43bab..a8da9bb1f7 100644
>>>>>> --- a/Documentation/topics/usdt-probes.rst
>>>>>> +++ b/Documentation/topics/usdt-probes.rst
>>>>>> @@ -214,6 +214,7 @@ Available probes in ``ovs_vswitchd``:
>>>>>> - dpif_recv:recv_upcall
>>>>>> - main:poll_block
>>>>>> - main:run_start
>>>>>> +- revalidate:flow_result
>>>>>> - revalidate_ukey\_\_:entry
>>>>>> - revalidate_ukey\_\_:exit
>>>>>> - udpif_revalidator:start_dump
>>>>>
>>>>> You are missing the specific flow_result result section. This is from the
>>>>> previous patch:
>>>>
>>>> D'oh! Thanks for catching it. I'll re-add it.
>>>>
>>>>> @@ -358,6 +360,27 @@ See also the ``main:run_start`` probe above.
>>>>> - ``utilities/usdt-scripts/bridge_loop.bt``
>>>>>
>>>>>
>>>>> +probe revalidate:flow_result
>>>>> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>>>>> +
>>>>> +**Description**:
>>>>> +This probe is triggered when the revalidator decides whether or not to
>>>>> +revalidate a flow. ``reason`` is an enum that denotes that either the
>>>>> flow
>>>>> +is being kept, or the reason why the flow is being deleted. The
>>>>> +``flow_reval_monitor.py`` script uses this probe to notify users when
>>>>> flows
>>>>> +matching user-provided criteria are deleted.
>>>>> +
>>>>> +**Arguments**:
>>>>> +
>>>>> +- *arg0*: ``(enum flow_del_reason) reason``
>>>>> +- *arg1*: ``(struct udpif *) udpif``
>>>>> +- *arg2*: ``(struct udpif_key *) ukey``
>>>>> +
>>>>> +**Script references**:
>>>>> +
>>>>> +- ``utilities/usdt-scripts/flow_reval_monitor.py``
>>>>> +
>>>>> +
>>>>> Adding your own probes
>>>>> ----------------------
>>>>>
>>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c
>>>>>> b/ofproto/ofproto-dpif-upcall.c
>>>>>> index b5cbeed878..97d75833f7 100644
>>>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>>>> @@ -269,6 +269,18 @@ enum ukey_state {
>>>>>> };
>>>>>> #define N_UKEY_STATES (UKEY_DELETED + 1)
>>>>>>
>>>>>> +enum flow_del_reason {
>>>>>> + FDR_REVALIDATE = 0, /* The flow was revalidated. */
>>>>>
>>>>> It was called FDR_FLOW_LIVE before, which might make more sense. As the
>>>>> flow is just NOT deleted. It might or might not have been revalidated.
>>>>> Thoughts?
>>>>
>>>> I think it had to have been revalidated if we emit the reason, because
>>>> we only emit the reason code after revalidation. IE: there are many
>>>> places where we skip revalidation but the flow stays live - and we don't
>>>> emit reasons in those cases.
>>>>
>>>> So at least for this patch, it MUST have been revalidated. But maybe in
>>>> the future, we would want to catch cases where the flow hasn't been. In
>>>> that case, it makes sense to add the FDR_FLOW_LIVE at that time - I
>>>> think.
>>>>
>>>> Maybe you disagree?
>>>
>>> Well, it depends on how you define revalidation, it might only have updated
>>> the counters. i.e. it all depends on ‘bool need_revalidate =
>>> ukey->reval_seq != reval_seq;’ in revalidate_ukey(). That was why I opted
>>> for a more general name.
>>
>> Yes, but FDR_FLOW_LIVE seems too generic. At least, FDR_REVALIDATE
>> means it was flagged for revalidation, and in this case we did that, but
>> we just didn't need to do any key comparisons.
>>
>> I guess it doesn't matter too much, but the comment is that the flow
>> went through revalidation. If there becomes some other case where it
>> never calls revalidate_ukey() maybe that makes sense?
>
> Looking at the probe again, we should have the result as well, as FDR_ stands
> for flow delete reason, so if we do not delete the reason makes no sense. I
> gues FDR_REVALIDATE, shoud he FDR_NONE.
>
> What about the following change making it clearer, and the probes to be more
> precise (also change probe arg order to be more aligned with other probes):
>
>
> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
> index 97d75833f..61fba1766 100644
> --- a/ofproto/ofproto-dpif-upcall.c
> +++ b/ofproto/ofproto-dpif-upcall.c
> @@ -270,15 +270,17 @@ enum ukey_state {
> #define N_UKEY_STATES (UKEY_DELETED + 1)
>
> enum flow_del_reason {
> - FDR_REVALIDATE = 0, /* The flow was revalidated. */
> + FDR_NONE = 0, /* The flow was deleted. */
> + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */
> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */
> FDR_FLOW_IDLE, /* The flow went unused and was deleted. */
> - FDR_TOO_EXPENSIVE, /* The flow was too expensive to revalidate. */
> + FDR_FLOW_LIMIT, /* Flow limit reached, removing all flows. */
> FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard mask. */
> - FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */
> + FDR_PURGE, /* User requested flow purge. */
> FDR_NO_OFPROTO, /* The flow didn't have an associated ofproto. */
> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to revalidate. */
> + FDR_UPDATE_FAIL, /* Updating the flow in the dp failed. */
> FDR_XLATION_ERROR, /* There was an error translating the flow. */
> - FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */
> - FDR_FLOW_LIMIT, /* All flows being killed. */
> };
>
> /* 'udpif_key's are responsible for tracking the little bit of state udpif
> @@ -2285,7 +2287,7 @@ static enum reval_result
> revalidate_ukey__(struct udpif *udpif, const struct udpif_key *ukey,
> uint16_t tcp_flags, struct ofpbuf *odp_actions,
> struct recirc_refs *recircs, struct xlate_cache *xcache,
> - enum flow_del_reason *reason)
> + enum flow_del_reason *del_reason)
> {
> struct xlate_out *xoutp;
> struct netflow *netflow;
> @@ -2306,13 +2308,13 @@ revalidate_ukey__(struct udpif *udpif, const struct
> udpif_key *ukey,
> netflow = NULL;
>
> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) {
> - *reason = FDR_XLATION_ERROR;
> + *del_reason = FDR_XLATION_ERROR;
> goto exit;
> }
> xoutp = &ctx.xout;
>
> if (xoutp->avoid_caching) {
> - *reason = FDR_AVOID_CACHING;
> + *del_reason = FDR_AVOID_CACHING;
> goto exit;
> }
>
> @@ -2326,7 +2328,7 @@ revalidate_ukey__(struct udpif *udpif, const struct
> udpif_key *ukey,
> ofpbuf_clear(odp_actions);
>
> if (!ofproto) {
> - *reason = FDR_NO_OFPROTO;
> + *del_reason = FDR_NO_OFPROTO;
> goto exit;
> }
>
> @@ -2338,7 +2340,7 @@ revalidate_ukey__(struct udpif *udpif, const struct
> udpif_key *ukey,
> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask, &ctx.flow,
> NULL)
> == ODP_FIT_ERROR) {
> - *reason = FDR_BAD_ODP_FIT;
> + *del_reason = FDR_BAD_ODP_FIT;
> goto exit;
> }
>
> @@ -2348,7 +2350,7 @@ revalidate_ukey__(struct udpif *udpif, const struct
> udpif_key *ukey,
> * down. Note that we do not know if the datapath has ignored any of the
> * wildcarded bits, so we may be overly conservative here. */
> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) {
> - *reason = FDR_FLOW_WILDCARDED;
> + *del_reason = FDR_FLOW_WILDCARDED;
> goto exit;
> }
>
> @@ -2418,13 +2420,14 @@ static enum reval_result
> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
> const struct dpif_flow_stats *stats,
> struct ofpbuf *odp_actions, uint64_t reval_seq,
> - struct recirc_refs *recircs, enum flow_del_reason *reason)
> + struct recirc_refs *recircs, enum flow_del_reason
> *del_reason)
> OVS_REQUIRES(ukey->mutex)
> {
> bool need_revalidate = ukey->reval_seq != reval_seq;
> enum reval_result result = UKEY_DELETE;
> struct dpif_flow_stats push;
>
> + *del_reason = FDR_NONE;
> ofpbuf_clear(odp_actions);
>
> push.used = stats->used;
> @@ -2449,10 +2452,10 @@ revalidate_ukey(struct udpif *udpif, struct udpif_key
> *ukey,
> }
> result = revalidate_ukey__(udpif, ukey, push.tcp_flags,
> odp_actions, recircs, ukey->xcache,
> - reason);
> + del_reason);
> } else {
> /* delete; too expensive to revalidate */
> - *reason = FDR_TOO_EXPENSIVE;
> + *del_reason = FDR_TOO_EXPENSIVE;
> }
> } else if (!push.n_packets || ukey->xcache
> || !populate_xcache(udpif, ukey, push.tcp_flags)) {
> @@ -2853,8 +2856,8 @@ revalidate(struct revalidator *revalidator)
> for (f = flows; f < &flows[n_dumped]; f++) {
> long long int used = f->stats.used;
> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
> - enum flow_del_reason reason = FDR_REVALIDATE;
> struct dpif_flow_stats stats = f->stats;
> + enum flow_del_reason del_reason;
> enum reval_result result;
> struct udpif_key *ukey;
> bool already_dumped;
> @@ -2929,13 +2932,13 @@ revalidate(struct revalidator *revalidator)
> if (kill_them_all || (used && used < now - max_idle)) {
> result = UKEY_DELETE;
> if (kill_them_all) {
> - reason = FDR_FLOW_LIMIT;
> + del_reason = FDR_FLOW_LIMIT;
> } else {
> - reason = FDR_FLOW_IDLE;
> + del_reason = FDR_FLOW_IDLE;
> }
> } else {
> result = revalidate_ukey(udpif, ukey, &stats, &odp_actions,
> - reval_seq, &recircs, &reason);
> + reval_seq, &recircs, &del_reason);
> }
> ukey->dump_seq = dump_seq;
>
> @@ -2944,7 +2947,9 @@ revalidate(struct revalidator *revalidator)
> udpif_update_flow_pps(udpif, ukey, f);
> }
>
> - OVS_USDT_PROBE(revalidate, flow_result, reason, udpif, ukey);
> + OVS_USDT_PROBE(revalidate, flow_result, udpif, ukey, result,
> + del_reason);
> +
> if (result != UKEY_KEEP) {
> /* Takes ownership of 'recircs'. */
> reval_op_init(&ops[n_ops++], result, udpif, ukey, &recircs,
> @@ -2991,7 +2996,6 @@ revalidator_sweep__(struct revalidator *revalidator,
> bool purge)
> uint64_t odp_actions_stub[1024 / 8];
> struct ofpbuf odp_actions =
> OFPBUF_STUB_INITIALIZER(odp_actions_stub);
>
> - enum flow_del_reason reason = FDR_REVALIDATE;
> struct ukey_op ops[REVALIDATE_MAX_BATCH];
> struct udpif_key *ukey;
> struct umap *umap = &udpif->ukeys[i];
> @@ -3012,10 +3016,12 @@ revalidator_sweep__(struct revalidator *revalidator,
> bool purge)
> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
> bool seq_mismatch = (ukey->dump_seq != dump_seq
> && ukey->reval_seq != reval_seq);
> + enum flow_del_reason del_reason = FDR_NONE;
> enum reval_result result;
>
> if (purge || ukey_state == UKEY_INCONSISTENT) {
> result = UKEY_DELETE;
> + del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL;
> } else if (!seq_mismatch) {
> result = UKEY_KEEP;
> } else {
> @@ -3023,8 +3029,12 @@ revalidator_sweep__(struct revalidator *revalidator,
> bool purge)
> COVERAGE_INC(revalidate_missed_dp_flow);
> memcpy(&stats, &ukey->stats, sizeof stats);
> result = revalidate_ukey(udpif, ukey, &stats,
> &odp_actions,
> - reval_seq, &recircs, &reason);
> + reval_seq, &recircs,
> &del_reason);
> }
> +
> + OVS_USDT_PROBE(revalidator_sweep__, flow_result, udpif, ukey,
> + result, del_reason);
> +
> if (result != UKEY_KEEP) {
> /* Clears 'recircs' if filled by revalidate_ukey(). */
> reval_op_init(&ops[n_ops++], result, udpif, ukey,
> &recircs,
>
Okay - the above makes sense to me, with reasons included, so I'll fold
it in.
>>>>>> + FDR_FLOW_IDLE, /* The flow went unused and was deleted. */
>>>>>> + FDR_TOO_EXPENSIVE, /* The flow was too expensive to
>>>>>> revalidate. */
>>>>>> + FDR_FLOW_WILDCARDED, /* The flow needed a narrower wildcard
>>>>>> mask. */
>>>>>> + FDR_BAD_ODP_FIT, /* The flow had a bad ODP flow fit. */
>>>>>> + FDR_NO_OFPROTO, /* The flow didn't have an associated
>>>>>> ofproto. */
>>>>>> + FDR_XLATION_ERROR, /* There was an error translating the flow.
>>>>>> */
>>>>>> + FDR_AVOID_CACHING, /* Flow deleted to avoid caching. */
>>>>>> + FDR_FLOW_LIMIT, /* All flows being killed. */
>>>>>
>>>>> Looking at the comment from Han on FDR_PURGE, and this patch needing
>>>>> another spin, we should probably add it.
>>>>
>>>> I can do that, sure. In that case, we will need to have a new flow op
>>>> added to revalidator_sweep__ so that we can catch it. But in that case,
>>>> it will be a different usdt probe, so I still don't know if we need
>>>> FDR_PURGE right? WDYT?
>>>
>>> In revalidator_sweep__() you have sort of the following:
>>>
>>> if (purge || ukey_state == UKEY_INCONSISTENT) {
>>> result = UKEY_DELETE;
>>> } else if (!seq_mismatch) {
>>>
>>> And I’m afraid that if we use this tool to debug we miss the ukey_state ==
>>> UKEY_INCONSISTENT when debugging and spent a long time figuring this out.
>>> Maybe add something general like this (did not give it a lot of thought),
>>> and only take the FDR_PURGE : FDR_UPDATE_FAIL results in the script?
>>
>> I see - I guess that only happens when the error case errors out, but I
>> guess we can have it.
>>
>>> /* 'udpif_key's are responsible for tracking the little bit of state udpif
>>> @@ -2991,13 +2993,13 @@ revalidator_sweep__(struct revalidator
>>> *revalidator, bool purge)
>>> uint64_t odp_actions_stub[1024 / 8];
>>> struct ofpbuf odp_actions =
>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub);
>>>
>>> - enum flow_del_reason reason = FDR_REVALIDATE;
>>> struct ukey_op ops[REVALIDATE_MAX_BATCH];
>>> struct udpif_key *ukey;
>>> struct umap *umap = &udpif->ukeys[i];
>>> size_t n_ops = 0;
>>>
>>> CMAP_FOR_EACH(ukey, cmap_node, &umap->cmap) {
>>> + enum flow_del_reason reason = FDR_REVALIDATE;
>>> enum ukey_state ukey_state;
>>>
>>> /* Handler threads could be holding a ukey lock while it
>>> installs a
>>> @@ -3016,8 +3018,10 @@ revalidator_sweep__(struct revalidator *revalidator,
>>> bool purge)
>>>
>>> if (purge || ukey_state == UKEY_INCONSISTENT) {
>>> result = UKEY_DELETE;
>>> + reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL;
>>> } else if (!seq_mismatch) {
>>> result = UKEY_KEEP;
>>> + reason = FDR_REVALIDATE; //_KEEP
>>> } else {
>>> struct dpif_flow_stats stats;
>>> COVERAGE_INC(revalidate_missed_dp_flow);
>>> @@ -3030,6 +3034,8 @@ revalidator_sweep__(struct revalidator *revalidator,
>>> bool purge)
>>> reval_op_init(&ops[n_ops++], result, udpif, ukey,
>>> &recircs,
>>> &odp_actions);
>>> }
>>> + OVS_USDT_PROBE(revalidator_sweep__, flow_result, result,
>>> + reason, udpif, ukey);
>>> }
>>> ovs_mutex_unlock(&ukey->mutex);
>>>
>>>
>>>
>>>
>>> In addition in revalidator_sweep__() should the “enum flow_del_reason
>>> reason = FDR_REVALIDATE;” not be moved to the CMAP_FOR_EACH() loop?
>>
>> Okay - I can add this probe as well (and document it, etc). I think it
>> probably will be pretty noisy in the cases it happens. We can change
>> the scope as well.
>
> It will only be noisy on the flush, but this is fine. Now we will also catch
> users doing stuff (you never know what kind of script/debugging they do in
> production).
>
> See code suggestion above.
ACK - thanks.
>>>
>>>>>> +};
>>>>>> +
>>>>>> /* 'udpif_key's are responsible for tracking the little bit of state
>>>>>> udpif
>>>>>> * needs to do flow expiration which can't be pulled directly from the
>>>>>> * datapath. They may be created by any handler or revalidator thread
>>>>>> at any
>>>>>> @@ -2272,7 +2284,8 @@ populate_xcache(struct udpif *udpif, struct
>>>>>> udpif_key *ukey,
>>>>>> static enum reval_result
>>>>>> revalidate_ukey__(struct udpif *udpif, const struct udpif_key *ukey,
>>>>>> uint16_t tcp_flags, struct ofpbuf *odp_actions,
>>>>>> - struct recirc_refs *recircs, struct xlate_cache
>>>>>> *xcache)
>>>>>> + struct recirc_refs *recircs, struct xlate_cache
>>>>>> *xcache,
>>>>>> + enum flow_del_reason *reason)
>>>>>> {
>>>>>> struct xlate_out *xoutp;
>>>>>> struct netflow *netflow;
>>>>>> @@ -2293,11 +2306,13 @@ revalidate_ukey__(struct udpif *udpif, const
>>>>>> struct udpif_key *ukey,
>>>>>> netflow = NULL;
>>>>>>
>>>>>> if (xlate_ukey(udpif, ukey, tcp_flags, &ctx)) {
>>>>>> + *reason = FDR_XLATION_ERROR;
>>>>>> goto exit;
>>>>>> }
>>>>>> xoutp = &ctx.xout;
>>>>>>
>>>>>> if (xoutp->avoid_caching) {
>>>>>> + *reason = FDR_AVOID_CACHING;
>>>>>> goto exit;
>>>>>> }
>>>>>>
>>>>>> @@ -2311,6 +2326,7 @@ revalidate_ukey__(struct udpif *udpif, const
>>>>>> struct udpif_key *ukey,
>>>>>> ofpbuf_clear(odp_actions);
>>>>>>
>>>>>> if (!ofproto) {
>>>>>> + *reason = FDR_NO_OFPROTO;
>>>>>> goto exit;
>>>>>> }
>>>>>>
>>>>>> @@ -2322,6 +2338,7 @@ revalidate_ukey__(struct udpif *udpif, const
>>>>>> struct udpif_key *ukey,
>>>>>> if (odp_flow_key_to_mask(ukey->mask, ukey->mask_len, &dp_mask,
>>>>>> &ctx.flow,
>>>>>> NULL)
>>>>>> == ODP_FIT_ERROR) {
>>>>>> + *reason = FDR_BAD_ODP_FIT;
>>>>>> goto exit;
>>>>>> }
>>>>>>
>>>>>> @@ -2331,6 +2348,7 @@ revalidate_ukey__(struct udpif *udpif, const
>>>>>> struct udpif_key *ukey,
>>>>>> * down. Note that we do not know if the datapath has ignored any
>>>>>> of the
>>>>>> * wildcarded bits, so we may be overly conservative here. */
>>>>>> if (flow_wildcards_has_extra(&dp_mask, ctx.wc)) {
>>>>>> + *reason = FDR_FLOW_WILDCARDED;
>>>>>> goto exit;
>>>>>> }
>>>>>>
>>>>>> @@ -2400,7 +2418,7 @@ static enum reval_result
>>>>>> revalidate_ukey(struct udpif *udpif, struct udpif_key *ukey,
>>>>>> const struct dpif_flow_stats *stats,
>>>>>> struct ofpbuf *odp_actions, uint64_t reval_seq,
>>>>>> - struct recirc_refs *recircs)
>>>>>> + struct recirc_refs *recircs, enum flow_del_reason
>>>>>> *reason)
>>>>>> OVS_REQUIRES(ukey->mutex)
>>>>>> {
>>>>>> bool need_revalidate = ukey->reval_seq != reval_seq;
>>>>>> @@ -2430,8 +2448,12 @@ revalidate_ukey(struct udpif *udpif, struct
>>>>>> udpif_key *ukey,
>>>>>> xlate_cache_clear(ukey->xcache);
>>>>>> }
>>>>>> result = revalidate_ukey__(udpif, ukey, push.tcp_flags,
>>>>>> - odp_actions, recircs,
>>>>>> ukey->xcache);
>>>>>> - } /* else delete; too expensive to revalidate */
>>>>>> + odp_actions, recircs,
>>>>>> ukey->xcache,
>>>>>> + reason);
>>>>>> + } else {
>>>>>> + /* delete; too expensive to revalidate */
>>>>>> + *reason = FDR_TOO_EXPENSIVE;
>>>>>> + }
>>>>>> } else if (!push.n_packets || ukey->xcache
>>>>>> || !populate_xcache(udpif, ukey, push.tcp_flags)) {
>>>>>> result = UKEY_KEEP;
>>>>>> @@ -2831,6 +2853,7 @@ revalidate(struct revalidator *revalidator)
>>>>>> for (f = flows; f < &flows[n_dumped]; f++) {
>>>>>> long long int used = f->stats.used;
>>>>>> struct recirc_refs recircs = RECIRC_REFS_EMPTY_INITIALIZER;
>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE;
>>>>>> struct dpif_flow_stats stats = f->stats;
>>>>>> enum reval_result result;
>>>>>> struct udpif_key *ukey;
>>>>>> @@ -2905,9 +2928,14 @@ revalidate(struct revalidator *revalidator)
>>>>>> }
>>>>>> if (kill_them_all || (used && used < now - max_idle)) {
>>>>>> result = UKEY_DELETE;
>>>>>> + if (kill_them_all) {
>>>>>> + reason = FDR_FLOW_LIMIT;
>>>>>> + } else {
>>>>>> + reason = FDR_FLOW_IDLE;
>>>>>> + }
>>>>>> } else {
>>>>>> result = revalidate_ukey(udpif, ukey, &stats,
>>>>>> &odp_actions,
>>>>>> - reval_seq, &recircs);
>>>>>> + reval_seq, &recircs, &reason);
>>>>>> }
>>>>>> ukey->dump_seq = dump_seq;
>>>>>>
>>>>>> @@ -2916,6 +2944,7 @@ revalidate(struct revalidator *revalidator)
>>>>>> udpif_update_flow_pps(udpif, ukey, f);
>>>>>> }
>>>>>>
>>>>>> + OVS_USDT_PROBE(revalidate, flow_result, reason, udpif,
>>>>>> ukey);
>>>>>> if (result != UKEY_KEEP) {
>>>>>> /* Takes ownership of 'recircs'. */
>>>>>> reval_op_init(&ops[n_ops++], result, udpif, ukey,
>>>>>> &recircs,
>>>>>> @@ -2962,6 +2991,7 @@ revalidator_sweep__(struct revalidator
>>>>>> *revalidator, bool purge)
>>>>>> uint64_t odp_actions_stub[1024 / 8];
>>>>>> struct ofpbuf odp_actions =
>>>>>> OFPBUF_STUB_INITIALIZER(odp_actions_stub);
>>>>>>
>>>>>> + enum flow_del_reason reason = FDR_REVALIDATE;
>>>>>> struct ukey_op ops[REVALIDATE_MAX_BATCH];
>>>>>> struct udpif_key *ukey;
>>>>>> struct umap *umap = &udpif->ukeys[i];
>>>>>> @@ -2993,7 +3023,7 @@ revalidator_sweep__(struct revalidator
>>>>>> *revalidator, bool purge)
>>>>>> COVERAGE_INC(revalidate_missed_dp_flow);
>>>>>> memcpy(&stats, &ukey->stats, sizeof stats);
>>>>>> result = revalidate_ukey(udpif, ukey, &stats,
>>>>>> &odp_actions,
>>>>>> - reval_seq, &recircs);
>>>>>> + reval_seq, &recircs,
>>>>>> &reason);
>>>>>> }
>>>>>> if (result != UKEY_KEEP) {
>>>>>> /* Clears 'recircs' if filled by revalidate_ukey().
>>>>>> */
>>>>>> diff --git a/utilities/automake.mk b/utilities/automake.mk
>>>>>> index 9a2114df40..146b8c37fb 100644
>>>>>> --- a/utilities/automake.mk
>>>>>> +++ b/utilities/automake.mk
>>>>>> @@ -23,6 +23,7 @@ scripts_DATA += utilities/ovs-lib
>>>>>> usdt_SCRIPTS += \
>>>>>> utilities/usdt-scripts/bridge_loop.bt \
>>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \
>>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \
>>>>>> utilities/usdt-scripts/kernel_delay.py \
>>>>>> utilities/usdt-scripts/kernel_delay.rst \
>>>>>> utilities/usdt-scripts/reval_monitor.py \
>>>>>> @@ -72,6 +73,7 @@ EXTRA_DIST += \
>>>>>> utilities/docker/debian/build-kernel-modules.sh \
>>>>>> utilities/usdt-scripts/bridge_loop.bt \
>>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \
>>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \
>>>>>> utilities/usdt-scripts/kernel_delay.py \
>>>>>> utilities/usdt-scripts/kernel_delay.rst \
>>>>>> utilities/usdt-scripts/reval_monitor.py \
>>>>>> @@ -146,6 +148,7 @@ FLAKE8_PYFILES += utilities/ovs-pcap.in \
>>>>>> utilities/ovs-tcpdump.in \
>>>>>> utilities/ovs-pipegen.py \
>>>>>> utilities/usdt-scripts/dpif_nl_exec_monitor.py \
>>>>>> + utilities/usdt-scripts/flow_reval_monitor.py \
>>>>>> utilities/usdt-scripts/upcall_monitor.py \
>>>>>> utilities/usdt-scripts/upcall_cost.py
>>>>>>
>>>>>> diff --git a/utilities/usdt-scripts/flow_reval_monitor.py
>>>>>> b/utilities/usdt-scripts/flow_reval_monitor.py
>>>>>> new file mode 100755
>>>>>> index 0000000000..e808020bb5
>>>>>> --- /dev/null
>>>>>> +++ b/utilities/usdt-scripts/flow_reval_monitor.py
>>>>>> @@ -0,0 +1,653 @@
>>>>>> +#!/usr/bin/env python3
>>>>>> +#
>>>>>> +# Copyright (c) 2022 Redhat, Inc.
>>>>>> +#
>>>>>> +# Licensed under the Apache License, Version 2.0 (the "License");
>>>>>> +# you may not use this file except in compliance with the License.
>>>>>> +# You may obtain a copy of the License at:
>>>>>> +#
>>>>>> +# http://www.apache.org/licenses/LICENSE-2.0
>>>>>> +#
>>>>>> +# Unless required by applicable law or agreed to in writing, software
>>>>>> +# distributed under the License is distributed on an "AS IS" BASIS,
>>>>>> +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
>>>>>> implied.
>>>>>> +# See the License for the specific language governing permissions and
>>>>>> +# limitations under the License.
>>>>>> +#
>>>>>> +# Script information:
>>>>>> +# -------------------
>>>>>> +# flow_reval_monitor.py uses the dpif_netlink_operate:flow_put and
>>>>>> +# revalidator:flow_result USDT probes to monitor flow lifetimes and
>>>>>> +# expiration events. By default, this will show all flow_put and flow
>>>>>> +# expiration events, along with their reasons. This will look like so:
>>>>>> +#
>>>>>> +# TIME UFID
>>>>>> EVENT/REASON
>>>>>> +# 101536.226986736 ufid:f76fc899-376d-466b-bc74-0000b933eb97
>>>>>> flow_put
>>>>>> +# 101536.227196214 ufid:d08472b6-110e-46cb-a9e4-00008f46838e
>>>>>> flow_put
>>>>>> +# 101541.516610178 ufid:fc5cc4a2-39e7-4a2d-bbce-000019665b32
>>>>>> flow_put
>>>>>> +# 101541.516967303 ufid:fddd6510-26dc-4c87-8f7a-0000fc0c2c3a
>>>>>> flow_put
>>>>>> +# 101551.688050747 ufid:fddd6510-26dc-4c87-8f7a-0000fc0c2c3a flow
>>>>>> timed out
>>>>>> +# 101551.688077175 ufid:fc5cc4a2-39e7-4a2d-bbce-000019665b32 flow
>>>>>> timed out
>>>>>> +# 101557.695391371 ufid:f76fc899-376d-466b-bc74-0000b933eb97 flow
>>>>>> timed out
>>>>>> +# 101557.695408909 ufid:d08472b6-110e-46cb-a9e4-00008f46838e flow
>>>>>> timed out
>>>>>> +#
>>>>>> +# flow key data can be printed using the --flow-keys option. This will
>>>>>> +# print the equivalent datapath flow string.
>>>>>> +#
>>>>>> +# When filtering flows, the syntax is the same as used by
>>>>>> +# `ovs-appctl dpctl/add-flow`.
>>>>>> +#
>>>>>> +# The following options are available:
>>>>>> +#
>>>>>> +# usage: flow_reval_monitor.py [-h] [--buffer-page-count NUMBER]
>>>>>> +# [-k [FLOW_KEYS]] [-p VSWITCHD_PID]
>>>>>> +# [-D [DEBUG]] [-f [FLOW STRING ...]]
>>>>>
>>>>> # usage: flow_reval_monitor.py [-h] [--buffer-page-count NUMBER]
>>>>> [-f [64-2048]] [-k] [-l [FLOW_STRING
>>>>> ...]]
>>>>> [-p VSWITCHD_PID] [-D [DEBUG]]
>>>>
>>>> Oops, I'll fix it up.
>>>>
>>>>>> +#
>>>>>> +# optional arguments:
>>>>>> +# -h, --help show this help message and exit
>>>>>> +# --buffer-page-count NUMBER
>>>>>> +# Number of BPF ring buffer pages, default 1024
>>>>>> +# -f <64..2048>, --flow-key-size=<64..2048>
>>>>>> +# Set the size of the flow key, default 64
>>>>>> +# -k, --flow-keys Print flow keys as flow strings
>>>>>> +# -l [FLOW_STRING ...], --filter-flows [FLOW_STRING ...]
>>>>>> +# Filter flows that match the specified
>>>>>> ODP-like flow
>>>>>
>>>>> We do not filter on the flow itself but on the packet content/keys
>>>>> creating the flow.
>>>>> We might want to clarify this as the actual DP flow might not include.,
>>>>> for example, the IP fields.
>>>>
>>>> I guess it's ambiguous. I'll try and clean up the language. Because we
>>>> are filtering on the ODP flow key, and not an openflow string.
>>>>
>>>>>> +# -p VSWITCHD_PID, --pid VSWITCHD_PID
>>>>>> +# ovs-vswitchd's PID
>>>>>> +# -D [DEBUG], --debug [DEBUG]
>>>>>> +# Enable eBPF debugging
>>>>>> +#
>>>>>> +# Examples:
>>>>>> +#
>>>>>> +# To use the script on a running ovs-vswitchd to see flow keys and
>>>>>> expiration
>>>>>> +# events for flows with an ipv4 source of 192.168.10.10:
>>>>>> +# $ ./flow_reval_monitor.py --flow-keys --filter-flows \
>>>>>> +# "ipv4(src=192.168.10.10)"
>>>>>
>>>>> Can we add some details on what kind of filters/format is supported? For
>>>>> example no mask support.
>>>>
>>>> Sure, I can add it.
>>>>
>>>>>> +# TIME UFID
>>>>>> EVENT/REASON
>>>>>> +# 105082.457322742 ufid:f76fc899-376d-466b-bc74-0000b933eb97
>>>>>> flow_put
>>>>>> +# ufid:f76fc899-376d-466b-bc74-0000b933eb97 has the following flow
>>>>>> information:
>>>>>> +# in_port(2),
>>>>>> +# eth(src=0e:04:47:fc:74:51, dst=da:dc:c5:69:05:d7), \
>>>>>> +# eth_type(0x800), \
>>>>>> +# ipv4(src=192.168.10.10, dst=192.168.10.30, proto=1, tos=0,
>>>>>> ttl=64,[...]),
>>>>>> +# icmp(type=8, code=0)
>>>>>> +# 105092.635450202 ufid:f76fc899-376d-466b-bc74-0000b933eb97 Flow
>>>>>> timed out
>>>>>> +#
>>>>>> +# Notes:
>>>>>> +# 1) No options are needed to attach when there is a single running
>>>>>> instance
>>>>>> +# of ovs-vswitchd.
>>>>>> +# 2) If you're using the flow filtering option, it will only track
>>>>>> flows that
>>>>>> +# have been upcalled since the script began running.
>>>>>> +# 3) When using the flow filtering option, the key size will likely
>>>>>> need to
>>>>>> +# be expanded to match on all the fields in the message. The
>>>>>> default is
>>>>>> +# kept small to keep the buffer copy sizes down when displaying
>>>>>> +# flows (-k), but is hardcoded to 2048 when an actual filter (-l)
>>>>>> is
>>>>>> +# applied
>>>>>
>>>>> We should add a note that the flow_put part is not included when HW
>>>>> offload (TC) is used for the kernel datapath, or if DPDK is used.
>>>>
>>>> That makes sense. But we will still have a revalidator output in
>>>> f.e. the DPDK case, IIRC.
>>>
>>> True, I just want to make sure we are clear that we will not see the
>>> flow_put messages hence the filtering will not work.
>>
>> Well, filtering could still work. Just needs a tweak of the bpf code to
>> include the key_ptr and key_len data. Maybe we can still include it.
>> For now, we don't pass on the udpif's key data, even though we have it
>> available.
>
> The only problem here is performance, now we will send each deleted down,
> whereas, in his original design, we only sent it down if it was in the ukey
> table.
>
> Maybe we should keep it as is for now, and think about the TC and DPDK use
> case in a follow up patch?
That makes sense to me.
>>>>>>
>>>>>> +try:
>>>>>> + from bcc import BPF
>>>>>> + from bcc import USDT
>>>>>> + from bcc import USDTException
>>>>>> +except ModuleNotFoundError:
>>>>>> + print("ERROR: Can't find the BPF Compiler Collection Tools.")
>>>>>> + print("Please install them before running this script.")
>>>>>> + exit(1)
>>>>>> +
>>>>>> +import argparse
>>>>>> +from ipaddress import IPv4Address, IPv6Address
>>>>>> +import psutil
>>>>>> +import struct
>>>>>> +import sys
>>>>>> +import time
>>>>>> +
>>>>>> +#
>>>>>> +# eBPF source code
>>>>>> +#
>>>>>> +bpf_src = """
>>>>>> +#include <linux/sched.h>
>>>>>> +#include <uapi/linux/ptrace.h>
>>>>>> +
>>>>>> +#define MAX_KEY <MAX_KEY_VAL>
>>>>>> +#define FLOW_FILTER <FILTER_BOOL>
>>>>>> +
>>>>>> +enum probe { OP_FLOW_PUT, FLOW_RESULT };
>>>>>> +
>>>>>> +typedef union ovs_u128 {
>>>>>> + unsigned int ufid32[4];
>>>>>> + unsigned long long ufid64[2];
>>>>>> +} ovs_u128;
>>>>>> +
>>>>>> +struct dpif_flow_put {
>>>>>> + int flags;
>>>>>> + void *key_ptr;
>>>>>> + size_t key_len;
>>>>>> + void *mask_ptr;
>>>>>> + size_t mask_len;
>>>>>> + u64 action_ptr;
>>>>>> + size_t action_len;
>>>>>> + void *ufid_ptr;
>>>>>> +};
>>>>>> +
>>>>>> +struct udpif_key {
>>>>>> + void *cmap_node;
>>>>>> + void *key_ptr;
>>>>>> + size_t key_len;
>>>>>> + void *mask_ptr;
>>>>>> + size_t mask_len;
>>>>>> + ovs_u128 ufid;
>>>>>> +};
>>>>>> +
>>>>>> +struct event_t {
>>>>>> + u64 ts;
>>>>>> + u32 reason;
>>>>>> + u32 ufid[4]; /* Can't seem to make the ovs_u128 pass to python
>>>>>> side. */
>>>>>
>>>>> Is this still true?
>>>>
>>>> I didn't try it. Actually, I think these data structures can all be
>>>> extracted with pahole or something which converts the ovs_u128.
>>>>
>>>> Actually I think there's some converter under the hood and it doesn't
>>>> have a mapping of what 'ovs_u128' means. So we need to basically teach
>>>> it to make it work if we want that.
>>>
>>> Ok, not a blocking thing, just wondered if there was a quick fix or not. I
>>> thought it might be related to the BCC issue.
>>
>> No, not directly. There is a side effect of bcc relying on CTypes to do
>> the type mapping - and in that case, it won't know what ovs_u128 looks
>> like. It doesn't matter too much, because something like pahole will
>> spit out a struct that will be basic data types (so bcc will understand
>> it just fine).
>>
>> I don't think it makes sense to try and fix it up here.
>
> ACK
>
>>>>>> + u64 key_size;
>>>>>> + u8 key[MAX_KEY];
>>>>>> + enum probe probe;
>>>>>> +};
>>>>>> +
>>>>>> +BPF_HASH(watchlist, ovs_u128);
>>>>>> +BPF_RINGBUF_OUTPUT(events, <BUFFER_PAGE_COUNT>);
>>>>>> +
>>>>>> +int usdt__flow_result(struct pt_regs *ctx) {
>>>>>> + u64 *ufid_present = NULL;
>>>>>> + struct udpif_key ukey;
>>>>>> +
>>>>>> + bpf_usdt_readarg_p(3, ctx, &ukey, sizeof ukey);
>>>>>> + ovs_u128 ufid = ukey.ufid;
>>>>>> + ufid_present = watchlist.lookup(&ufid);
>>>>>> + if(FLOW_FILTER && !ufid_present) {
>>>>>> + return 0;
>>>>>> + }
>>>>>> +
>>>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct
>>>>>> event_t));
>>>>>> + if(!event) {
>>>>>> + /* If we can't reserve the space in the ring buffer, return 1.
>>>>>> */
>>>>>
>>>>> See comments at the end regarding __sync_fetch_and_add().
>>>>>
>>>>>> + return 1;
>>>>>> + }
>>>>>> +
>>>>>> + event->probe = FLOW_RESULT;
>>>>>> + event->ts = bpf_ktime_get_ns();
>>>>>> + bpf_probe_read(&event->ufid, sizeof ufid, &ufid);
>>>>>> + bpf_usdt_readarg(1, ctx, &event->reason);
>>>>>> + events.ringbuf_submit(event, 0);
>>>>>> +
>>>>>> + return 0;
>>>>>> +};
>>>>>> +
>>>>>> +
>>>>>> +int usdt__op_flow_put(struct pt_regs *ctx) {
>>>>>> + struct dpif_flow_put put;
>>>>>> + ovs_u128 ufid;
>>>>>> +
>>>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct
>>>>>> event_t));
>>>>>> + if(!event) {
>>>>>> + /* If we can't reserve the space in the ring buffer, return 1.
>>>>>> */
>>>>>
>>>>> See comments at the end regarding __sync_fetch_and_add().
>>>>>
>>>>>> + return 1;
>>>>>> + }
>>>>>> +
>>>>>> + event->probe = OP_FLOW_PUT;
>>>>>> + event->ts = bpf_ktime_get_ns();
>>>>>> + bpf_usdt_readarg_p(2, ctx, &put, sizeof put);
>>>>>> + bpf_probe_read(&event->ufid, sizeof event->ufid, put.ufid_ptr);
>>>>>> + bpf_probe_read(&ufid, sizeof ufid, &event->ufid);
>>>>>> + if (put.key_len > MAX_KEY) {
>>>>>> + put.key_len = MAX_KEY;
>>>>>> + }
>>>>>> + event->key_size = put.key_len;
>>>>>> + bpf_probe_read(&event->key, put.key_len, put.key_ptr);
>>>>>> + event->reason = 0;
>>>>>> + events.ringbuf_submit(event, 0);
>>>>>> +
>>>>>> + watchlist.increment(ufid);
>>>>>> + return 0;
>>>>>> +};
>>>>>> +"""
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# buffer_size_type()
>>>>>> +#
>>>>>> +def buffer_size_type(astr, min=64, max=2048):
>>>>>> + value = int(astr)
>>>>>> + if min <= value <= max:
>>>>>> + return value
>>>>>> + else:
>>>>>> + raise argparse.ArgumentTypeError(
>>>>>> + 'value not in range {}-{}'.format(min, max))
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# format_ufid()
>>>>>> +#
>>>>>> +def format_ufid(ufid):
>>>>>> + if ufid is None:
>>>>>> + return "ufid:none"
>>>>>> +
>>>>>> + return "ufid:{:08x}-{:04x}-{:04x}-{:04x}-{:04x}{:08x}".format(
>>>>>> + ufid[0], ufid[1] >> 16, ufid[1] & 0xffff,
>>>>>> + ufid[2] >> 16, ufid[2] & 0, ufid[3])
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# find_and_delete_from_watchlist()
>>>>>> +#
>>>>>> +def find_and_delete_from_watchlist(event):
>>>>>> + for k, _ in b["watchlist"].items():
>>>>>> + key_ufid = struct.unpack("=IIII", k)
>>>>>> + if key_ufid == tuple(event.ufid):
>>>>>> + key = (b["watchlist"].Key * 1)(k)
>>>>>> + b["watchlist"].items_delete_batch(key)
>>>>>> + break
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# handle_flow_put()
>>>>>> +#
>>>>>> +def handle_flow_put(event):
>>>>>> + if args.flow_keys or args.filter_flows is not None:
>>>>>> + key = decode_key(bytes(event.key)[:event.key_size])
>>>>>> + flow_dict, flow_str = parse_flow_dict(key)
>>>>>> + # For each attribute that we're watching.
>>>>>> + if args.filter_flows is not None:
>>>>>> + if not compare_flow_to_target(args.filter_flows, flow_dict):
>>>>>> + find_and_delete_from_watchlist(event)
>>>>>> + return
>>>>>> +
>>>>>> + print("{:<18.9f} {:<45} {:<13}".format(event.ts / 1000000000,
>>>>>> + format_ufid(event.ufid), "Insert (put) flow to kernel."))
>>>>>
>>>>> Maybe change this to “Insert (put) flow to kernel module.” to valid
>>>>> missing tc flow put?
>>>>
>>>> Ack.
>>>>
>>>>>> +
>>>>>> + if args.flow_keys:
>>>>>> + if len(flow_str) > 80:<
>>>>>> + flow_str = " " + "),\n ".join(flow_str.split("), “))<<
>>>>>> + else:
>>>>>> + flow_str = " " + flow_str
>>>>>> + print(" - It holds the following flow information:")
>>>>>
>>>>> This is confusing as, it’s not the flow information, i.e. flow installed,
>>>>> but the keys from the packet.
>>>>
>>>> Agreed.
>>>>
>>>>>> + print(flow_str)
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# compare_flow_to_target()
>>>>>> +#
>>>>>> +def compare_flow_to_target(target, flow):
>>>>>> + for key in target:
>>>>>> + if key not in flow:
>>>>>> + return False
>>>>>> + elif target[key] is True:
>>>>>> + continue
>>>>>> + elif target[key] == flow[key]:
>>>>>> + continue
>>>>>> + elif isinstance(target[key], dict) and isinstance(flow[key],
>>>>>> dict):
>>>>>> + return compare_flow_to_target(target[key], flow[key])
>>>>>> + else:
>>>>>> + return False
>>>>>> + return True
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# parse_flow_str()
>>>>>> +#
>>>>>> +def parse_flow_str(flow_str):
>>>>>> + f_list = [i.strip(", ") for i in flow_str.split(")")]
>>>>>> + if f_list[-1] == "":
>>>>>> + f_list = f_list[:-1]
>>>>>> + flow_dict = {}
>>>>>> + for e in f_list:
>>>>>> + split_list = e.split("(")
>>>>>> + k = split_list[0]
>>>>>> + if len(split_list) == 1:
>>>>>> + flow_dict[k] = True
>>>>>> + elif split_list[1].count("=") == 0:
>>>>>> + flow_dict[k] = split_list[1]
>>>>>> + else:
>>>>>> + sub_dict = {}
>>>>>> + sublist = [i.strip() for i in split_list[1].split(",")]
>>>>>> + for subkey in sublist:
>>>>>> + brk = subkey.find("=")
>>>>>> + sub_dict[subkey[:brk]] = subkey[brk + 1:]
>>>>>> + flow_dict[k] = sub_dict
>>>>>> + return flow_dict
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# print_expiration()
>>>>>> +#
>>>>>> +def print_expiration(event):
>>>>>> + reasons = ["Unknown flow expiration reason!", "Flow timed out",
>>>>>> + "Flow revalidation too expensive",
>>>>>> + "Flow needs narrower wildcard mask",
>>>>>> + "Bad ODP flow fit", "Flow with associated ofproto",
>>>>>> + "Flow translation error", "Flow cache avoidance",
>>>>>> + "Kill them all signal"]
>>>>>
>>>>> Should we maybe define this with something like this:
>>>>>
>>>>> Event = IntEnum("flow_del_reason", ["FDR_FLOW_LIVE",
>>>>> "FDR_FLOW_TIME_OUT",
>>>>> ...], start=0)
>>>>>
>>>>> If we do this, we can also use flow_del_reason.FDR_FLOW_LIVE below.
>>>>
>>>> I wrote a bit below, but I was wondering if there's really a better way
>>>> to do this like extracting the details from the code itself. But for
>>>> now, I can hard code something in there like is done in the other
>>>> revalidator script.
>>>
>>> Dont think we had scripts relying on OVS enums before. Not sure if pahole
>>> can extract this also.
>>
>> I guess I didn't read the other scripts well enough. Yes, seems there's
>> no good way to get this details.
>>
>>>>>> + ufid_str = format_ufid(event.ufid)
>>>>>> + reason = event.reason
>>>>>> +
>>>>>> + if reason not in range(0, len(reasons) - 1):
>>>>>> + reason = 0
>>>>>> + print("{:<18.9f} {:<45} {:<17}".
>>>>>> + format(event.ts / 1000000000, ufid_str, reasons[reason]))
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# decode_key()
>>>>>> +#
>>>>>> +def decode_key(msg):
>>>>>> + bytes_left = len(msg)
>>>>>> + result = {}
>>>>>> + while bytes_left:
>>>>>> + if bytes_left < 4:
>>>>>> + break
>>>>>> + nla_len, nla_type = struct.unpack("=HH", msg[:4])
>>>>>> + if nla_len < 4:
>>>>>> + break
>>>>>> + nla_data = msg[4:nla_len]
>>>>>> + trunc = False
>>>>>> + if nla_len > bytes_left:
>>>>>> + trunc = True
>>>>>> + nla_data = nla_data[:(bytes_left - 4)]
>>>>>
>>>>> Can we just not break out of this right away without doing
>>>>> the two above lines?
>>>>
>>>> I'll double check - I think I can rewrite this section a bit.
>>>>
>>>>>> + else:
>>>>>> + result[get_ovs_key_attr_str(nla_type)] = nla_data
>>>>>> + if trunc:
>>>>>> + break
>>>>>> + next_offset = (nla_len + 3) & (~3)
>>>>>> + msg = msg[next_offset:]
>>>>>> + bytes_left -= next_offset
>>>>>
>>>>> if bytes_left:
>>>>> “Can we report that our buffer was truncated?!”
>>>>>
>>>>> Not sure how to do this, but with 64 bytes being the default the -k
>>>>> option only showed in_port() which took me a while to figure out. Maybe
>>>>> 128 would be better when -k is configured?
>>>>
>>>> Good idea. Actually, I don't know if 64 bytes would ever really make
>>>> sense anyway because it doesn't allow much to include.
>>>
>>> Agreed, I think 128 sounds like a good middle ground, however, it will not
>>> decode ARP messages (have not tried ipv6). Maybe 64 is good enough if -k/-f
>>> is not supplied (I guess we can even set it to 0 without -k or -f).
>>
>> Exactly. It's an option to trim on how large the event size is, and
>> that means trimming on buffer space for the key data.
>>
>>>>>> + return result
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# get_ovs_key_attr_str()
>>>>>> +#
>>>>>> +def get_ovs_key_attr_str(attr):
>>>>>> + ovs_key_attr = ["OVS_KEY_ATTR_UNSPEC",
>>>>>> + "encap",
>>>>>> + "skb_priority",
>>>>>> + "in_port",
>>>>>> + "eth",
>>>>>> + "vlan",
>>>>>> + "eth_type",
>>>>>> + "ipv4",
>>>>>> + "ipv6",
>>>>>> + "tcp",
>>>>>> + "udp",
>>>>>> + "icmp",
>>>>>> + "icmpv6",
>>>>>> + "arp",
>>>>>> + "nd",
>>>>>> + "skb_mark",
>>>>>> + "tunnel",
>>>>>> + "sctp",
>>>>>> + "tcp_flags",
>>>>>> + "dp_hash",
>>>>>> + "recirc_id",
>>>>>> + "mpls",
>>>>>> + "ct_state",
>>>>>> + "ct_zone",
>>>>>> + "ct_mark",
>>>>>> + "ct_label",
>>>>>> + "ct_tuple4",
>>>>>> + "ct_tuple6",
>>>>>> + "nsh"]
>>>>>> +
>>>>>> + if attr < 0 or attr > len(ovs_key_attr):
>>>>>> + return "<UNKNOWN>: {}".format(attr)
>>>>>> + return ovs_key_attr[attr]
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# is_nonzero()
>>>>>> +#
>>>>>> +def is_nonzero(val):
>>>>>> + if isinstance(val, int):
>>>>>> + return (val != 0)
>>>>>> +
>>>>>> + if isinstance(val, str):
>>>>>> + val = bytes(val, "utf-8")
>>>>>> +
>>>>>> + # If it's not a string or an int, it's bytes.
>>>>>> + return (val.count(0) < len(val))
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# parse_flow_dict()
>>>>>> +#
>>>>>> +def parse_flow_dict(key_dict, decode=True):
>>>>>> + ret_str = ""
>>>>>> + parseable = {}
>>>>>> + skip = ["nsh", "tunnel", "mpls", "vlan"]
>>>>>> + need_byte_swap = ["ct_label"]
>>>>>> + ipv4addrs = ["ct_tuple4", "tunnel", "ipv4", "arp"]
>>>>>> + ipv6addrs = ["ipv6", "nd", "ct_tuple6"]
>>>>>> + macs = {"eth": [0, 1], "arp": [3, 4], "nd": [1, 2]}
>>>>>> + fields = [("OVS_KEY_ATTR_UNSPEC"),
>>>>>> + ("encap", ),
>>>>>> + ("skb_priority", "<I"),
>>>>>> + ("in_port", "<I"),
>>>>>> + ("eth", "!6s6s", "src", "dst"),
>>>>>> + ("vlan", ),
>>>>>> + ("eth_type", "!H"),
>>>>>> + ("ipv4", "!4s4s4B", "src", "dst", "proto", "tos", "ttl",
>>>>>> "frag"),
>>>>>> + ("ipv6", "!16s16s4s4B", "src", "dst",
>>>>>> + "label", "proto", "tclass", "hlimit", "frag"),
>>>>>> + ("tcp", "!2H", "src", "dst"),
>>>>>> + ("udp", "!2H", "src", "dst"),
>>>>>> + ("icmp", "!2B", "type", "code"),
>>>>>> + ("icmpv6", "!2B", "type", "code"),
>>>>>> + ("arp", "!4s4sH6s6s", "sip", "tip", "op", "sha", "tha"),
>>>>>> + ("nd", "!16s6s6s", "target", "sll", "tll"),
>>>>>> + ("skb_mark", "<I"),
>>>>>> + ("tunnel", ),
>>>>>> + ("sctp", "!2H", "src", "dst"),
>>>>>> + ("tcp_flags", "!H"),
>>>>>> + ("dp_hash", "<I"),
>>>>>> + ("recirc_id", "<I"),
>>>>>> + ("mpls", ),
>>>>>> + ("ct_state", "<I"),
>>>>>> + ("ct_zone", "<H"),
>>>>>> + ("ct_mark", "<I"),
>>>>>> + ("ct_label", "!16s"),
>>>>>> + ("ct_tuple4",
>>>>>> + "!4s4s2HB", "src", "dst", "tp_src", "tp_dst", "proto"),
>>>>>> + ("ct_tuple6",
>>>>>> + "!16s16sB2H", "src", "dst", "proto", "tp_src", "tp_dst"),
>>>>>> + ("nsh", )]
>>>>>> + for k, v in key_dict.items():
>>>>>> + s = ""
>>>>>> + if k in skip:
>>>>>> + continue
>>>>>> + if decode and int.from_bytes(v, "big") == 0:
>>>>>> + parseable[k] = "0"
>>>>>> + continue
>>>>>> + if decode and k in need_byte_swap:
>>>>>> + v = int.from_bytes(v, "little").to_bytes(len(v), "big")
>>>>>> + attr = -1
>>>>>> + found = False
>>>>>> + for f in fields:
>>>>>> + if k == f[0]:
>>>>>> + attr = fields.index(f)
>>>>>> + found = True
>>>>>> + break
>>>>>> + if not found:
>>>>>> + raise KeyError("Invalid flow field '%s'" % k)
>>>>>> + if decode and len(fields[attr]) > 1:
>>>>>> + data = list(struct.unpack(fields[attr][1],
>>>>>> + v[:struct.calcsize(fields[attr][1])]))
>>>>>> + if k in ipv4addrs:
>>>>>> + if data[0].count(0) < 4:
>>>>>> + data[0] = str(IPv4Address(data[0]))
>>>>>> + else:
>>>>>> + data[0] = b"\x00"
>>>>>> + if data[1].count(0) < 4:
>>>>>> + data[1] = str(IPv4Address(data[1]))
>>>>>> + else:
>>>>>> + data[1] = b"\x00"
>>>>>> + if k in ipv6addrs:
>>>>>> + if data[0].count(0) < 16:
>>>>>> + data[0] = str(IPv6Address(data[0]))
>>>>>> + else:
>>>>>> + data[0] = b"\x00"
>>>>>> + if data[1].count(0) < len(data[1]):
>>>>>> + data[1] = str(IPv6Address(data[1]))
>>>>>> + else:
>>>>>> + data[1] = b"\x00"
>>>>>> + if k in macs.keys():
>>>>>> + for e in macs[k]:
>>>>>> + if data[e].count(0) == 6:
>>>>>> + mac_str = b"\x00"
>>>>>> + else:
>>>>>> + mac_str = ":".join(["%02x" % i for i in
>>>>>> data[e]])
>>>>>> + data[e] = mac_str
>>>>>> + if decode and len(fields[attr]) > 2:
>>>>>> + field_dict = {field: d for field, d in
>>>>>> zip(fields[attr][2:], data)}
>>>>>> + s = ", ".join(k + "=" + str(v) for k, v in
>>>>>> field_dict.items())
>>>>>> + elif decode and k != "eth_type":
>>>>>> + s = str(data[0])
>>>>>> + field_dict = s
>>>>>> + else:
>>>>>> + if decode:
>>>>>> + s = hex(data[0])
>>>>>> + field_dict = s
>>>>>> + ret_str += k + "(" + s + "), "
>>>>>> + parseable[k] = field_dict
>>>>>> + ret_str = ret_str[:-2]
>>>>>> + return (parseable, ret_str)
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# handle_event()
>>>>>> +#
>>>>>> +def handle_event(ctx, data, size):
>>>>>> + # Once we grab the event, we have three cases.
>>>>>> + # 1. It's a revalidator probe and the reason is nonzero: A flow is
>>>>>> expiring
>>>>>> + # 2. It's a revalidator probe and the reason is zero: flow
>>>>>> revalidated
>>>>>> + # 3. It's a flow_put probe.
>>>>>> + #
>>>>>> + # We will ignore case 2, and report all others.
>>>>>> + #
>>>>>> + event = b["events"].event(data)
>>>>>> + if event.probe == 0: # OP_FLOW_PUT
>>>>>
>>>>> Here we should also define an enum for the probe events, see ‘Event =
>>>>> IntEnum("Event”...’ and ‘<EVENT_ENUM>’ in reval_monitor.py
>>>>>
>>>>>> + handle_flow_put(event)<
>>>>>> + elif event.probe == 1 and event.reason > 0: # FLOW_RESULT
>>>>>
>>>>> Here we could do “event.reason > flow_del_reason.FDR_FLOW_LIVE”, see
>>>>> comment above.
>>>>
>>>> I can do the above, but I also am wondering if it's possible to have
>>>> something we can use to fill up the enum dynamically without needing to
>>>> duplicate things on the python side.
>>>
>>> That would be nice, maybe pahole already supports this.
>>>
>>>>>> + print_expiration(event)
>>>>>> +
>>>>>> +
>>>>>> +def main():
>>>>>> + #
>>>>>> + # Don't like these globals, but ctx passing does not work with the
>>>>>> existing
>>>>>> + # open_ring_buffer() API :(
>>>>>> + #
>>>>>> + global b
>>>>>> + global args
>>>>>> +
>>>>>> + #
>>>>>> + # Argument parsing
>>>>>> + #
>>>>>> + parser = argparse.ArgumentParser()
>>>>>> + parser.add_argument("--buffer-page-count",
>>>>>> + help="Number of BPF ring buffer pages, default
>>>>>> 1024",
>>>>>> + type=int, default=1024, metavar="NUMBER")
>>>>>> + parser.add_argument("-f", "--flow-key-size",
>>>>>> + help="Set maximum flow key size to capture, "
>>>>>> + "default 64 - see notes", type=buffer_size_type,
>>>>>> + default=64, metavar="[64-2048]")
>>>>>> + parser.add_argument("-k", "--flow-keys",
>>>>>> + help="Print flow keys as flow strings",
>>>>>> + action="store_true")
>>>>>> + parser.add_argument("-l", "--filter-flows", metavar="FLOW_STRING",
>>>>>> + help="Filter flows that match the specified "
>>>>>> + "ODP-like flow",
>>>>>> + type=str, default=None, nargs="*")
>>>>>> + parser.add_argument("-p", "--pid", metavar="VSWITCHD_PID",
>>>>>> + help="ovs-vswitchd's PID", type=int,
>>>>>> default=None)
>>>>>> + parser.add_argument("-D", "--debug", help="Enable eBPF debugging",
>>>>>> + type=int, const=0x3f, default=0, nargs="?")
>>>>>> + args = parser.parse_args()
>>>>>> +
>>>>>> + #
>>>>>> + # Find the PID of the ovs-vswitchd daemon if not specified.
>>>>>> + #
>>>>>> + if args.pid is None:
>>>>>> + for proc in psutil.process_iter():
>>>>>> + if "ovs-vswitchd" in proc.name():
>>>>>> + if args.pid is not None:
>>>>>> + print("Error: Multiple ovs-vswitchd daemons
>>>>>> running, "
>>>>>> + "use the -p option!")
>>>>>> + sys.exit(-1)
>>>>>> +
>>>>>> + args.pid = proc.pid
>>>>>> + #
>>>>>> + # Error checking on input parameters
>>>>>> + #
>>>>>> + if args.pid is None:
>>>>>> + print("ERROR: Failed to find ovs-vswitchd's PID!")
>>>>>> + sys.exit(-1)
>>>>>> +
>>>>>> + #
>>>>>> + # Attach the USDT probes
>>>>>> + #
>>>>>> + u = USDT(pid=int(args.pid))
>>>>>> + try:
>>>>>> + u.enable_probe(probe="op_flow_put", fn_name="usdt__op_flow_put")
>>>>>> + except USDTException as e:
>>>>>> + print("Error attaching the dpif_netlink_operate__:op_flow_put
>>>>>> probe.")
>>>>>> + print(str(e))
>>>>>> + sys.exit(-1)
>>>>>> +
>>>>>> + try:
>>>>>> + u.enable_probe(probe="flow_result", fn_name="usdt__flow_result")
>>>>>> + except USDTException as e:
>>>>>> + print("Error attaching the revalidate:flow_result probe.")
>>>>>> + print(str(e))
>>>>>> + sys.exit(-1)
>>>>>> +
>>>>>> + #
>>>>>> + # Attach the probes to the running process
>>>>>> + #
>>>>>> + source = bpf_src.replace("<BUFFER_PAGE_COUNT>",
>>>>>> + str(args.buffer_page_count))
>>>>>> +
>>>>>> + if args.filter_flows is None:
>>>>>> + filter_bool = 0
>>>>>> +
>>>>>> + # Set the key size based on what the user wanted
>>>>>> + source = source.replace("<MAX_KEY_VAL>",
>>>>>> str(args.flow_key_size))
>>>>>> + else:
>>>>>> + filter_bool = 1
>>>>>> + args.filter_flows = parse_flow_str(args.filter_flows[0])
>>>>>> +
>>>>>> + # Run through the parser to make sure we only filter on fields
>>>>>> we
>>>>>> + # understand
>>>>>> + parse_flow_dict(args.filter_flows, False)
>>>>>> +
>>>>>> + # This is hardcoded here because it doesn't make sense to
>>>>>> shrink the
>>>>>> + # size, since the flow key might be missing fields that are
>>>>>> matched in
>>>>>> + # the flow filter.
>>>>>> + source = source.replace("<MAX_KEY_VAL>", "2048")
>>>>>> +
>>>>>> + source = source.replace("<FILTER_BOOL>", str(filter_bool))
>>>>>> +
>>>>>> + b = BPF(text=source, usdt_contexts=[u], debug=args.debug)
>>>>>> +
>>>>>> + #
>>>>>> + # Print header
>>>>>> + #
>>>>>> + print("{:<18} {:<45} {:<17}".format("TIME", "UFID", "EVENT/REASON"))
>>>>>> +
>>>>>> + #
>>>>>> + # Dump out all events.
>>>>>> + #
>>>>>> + b["events"].open_ring_buffer(handle_event)
>>>>>> + while 1:
>>>>>> + try:
>>>>>> + b.ring_buffer_poll()
>>>>>> + time.sleep(0.5)
>>>>>
>>>>> I think we can remove this sleep.
>>>>
>>>> I'll try without it. IIRC, the ring buffer polling was very aggressive
>>>> on the CPU, but that is just a memory from mid-2022.
>>>
>>> I got the ‘remove’ comment from Adrian also a while back and did some tests
>>> and I did not see any load increase on the Python application. But it might
>>> be worth it for you to do the same, you never know where I screwed up ;)
>>
>> Okay, I'll retry it and drop if it causes an issue.
>>
>>>>>> + except KeyboardInterrupt:
>>>>>> + break
>>>>>> +
>>>>>> +
>>>>>> +#
>>>>>> +# Start main() as the default entry point
>>>>>> +#<
>>>>>> +if __name__ == "__main__":
>>>>>> + main()
>>>>>> --
>>>>>> 2.41.0
>>>>>
>>>>> Missing my previous comment on adding a check to make sure we do not lose
>>>>> events:
>>>>>
>>>>> “
>>>>> Forgot to mention that you probably also want to add some checking to
>>>>> make sure you do not lose events.
>>>>>
>>>>> See __sync_fetch_and_add() below:
>>>>>
>>>>> +BPF_TABLE("percpu_array", uint32_t, uint64_t, dropcnt, 1);
>>>>> +
>>>>> +static struct event_t *get_event(uint32_t id) {
>>>>> + struct event_t *event = events.ringbuf_reserve(sizeof(struct
>>>>> event_t));
>>>>> +
>>>>> + if (!event) {
>>>>> + uint32_t type = 0;
>>>>> + uint64_t *value = dropcnt.lookup(&type);
>>>>> + if (value)
>>>>> + __sync_fetch_and_add(value, 1);
>>>>> +
>>>>> + return NULL;
>>>>> + }
>>>>> +
>>>>> + event->id = id;
>>>>> + event->ts = bpf_ktime_get_ns();
>>>>> + event->pid = bpf_get_current_pid_tgid();
>>>>> +
>>>>> + return event;
>>>>> +}
>>>>> “
>>>>>
>>>>> The other missing part is to include the PID/TID in the output so we can
>>>>> relate to which revalidator thread did this (or add the comm with the
>>>>> name).
>>>>
>>>> Okay.
>>>>
>>>>> And finally, the part that got this patch delayed, not adding static OVS
>>>>> structure definitions. Which is still the case in this version. For now,
>>>>> you should probably copy the get_ovs_definitions() implementation from
>>>>> reval_monitor.py.
>>>>
>>>> Will do.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev