On 03/06/2024 16:29, Eelco Chaudron wrote:
> 
> 
> On 3 Jun 2024, at 10:07, Roi Dayan wrote:
> 
>> On 03/06/2024 10:18, Roi Dayan wrote:
>>>
>>>
>>> On 30/05/2024 18:48, Eelco Chaudron wrote:
>>>>
>>>>
>>>> On 23 May 2024, at 12:46, Roi Dayan via dev wrote:
>>>>
>>>>> It is observed in some environments that there are much more ukeys than
>>>>> actual DP flows. For example:
>>>>>
>>>>> $ ovs-appctl upcall/show
>>>>> system@ovs-system:
>>>>> flows : (current 7) (avg 6) (max 117) (limit 2125)
>>>>> offloaded flows : 525
>>>>> dump duration : 1063ms
>>>>> ufid enabled : true
>>>>>
>>>>> 23: (keys 3612)
>>>>> 24: (keys 3625)
>>>>> 25: (keys 3485)
>>>>>
>>>>> The revalidator threads are busy revalidating the stale ukeys leading to
>>>>> high CPU and long dump duration.
>>>>>
>>>>> There are some possible situations that may result in stale ukeys that
>>>>> have no corresponding DP flows.
>>>>>
>>>>> In revalidator, push_dp_ops() doesn't check error if the op type is not
>>>>> DEL. It is possible that a PUT(MODIFY) fails, especially for tc offload
>>>>> case, where the old flow is deleted first and then the new one is
>>>>> created. If the creation fails, the ukey will be stale (no corresponding
>>>>> DP flow). This patch adds a warning in such case.
>>>>
>>>> Not sure I understand, this behavior should be captured by the 
>>>> UKEY_INCONSISTENT state.
>>>
>>> Hi Eelco,
>>>
>>> thanks for reviewing.
>>> We started the patch on older branch as we didn't rebase for some time
>>> and got a little later on sending it.
>>> I see the addition now for UKEY_INCOSISTENT in the following patch:
>>>
>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors.
>>>
>>> Looks like it handles the same situation we tried to handle in this patch.
>>>
>>>>
>>>>> Another possible scenario is in handle_upcalls() if a PUT operation did
>>>>> not succeed and op->error attribute was not set correctly it can lead to
>>>>> stale ukey in operational state.
>>>>
>>>>
>>>> Guess we need to figure out these cases, as these are the root cause of 
>>>> your problem.
>>>
>>> right. maybe. but this could help keep the system alive/clean while logging 
>>> the
>>> bad situation instead of having increase in those stale/inconsistent ukeys.
>>> I understand if it's not nice to handle it like this.
>>>
>>
>> Hi Eelco,
>>
>> I remember now one of the reproduction scenarios we did is do some traffic
>> to get rules added using TC and then delete those from tc command line
>> and it got to stale ukeys.
>> The revalidator dump thread not seeing the rules so not updating anything
>> and sweep over the ukeys skipping them as well.
>> This is why we checked against the timing stats of the ukey.
>>
>> I remember I tested on the upstream code and not our development branch
>> and it reproduces. I didn't notice if the commit adding UKEY_INCONSISTENT
>> existed but it handles errors from adding flows so I dont think it matters.
>>
>> I'll try to check and verify again but I think it's still here.
>> So while cases getting dop.error handled with UKEY_INCONSISTENT,
>> this case I think is not.
> 
> I think you are right this case is not covered by the UKEY_INCONSISTENT test 
> below. See my suggestion on fixing this in revalidate_ukey().
> 
> Maybe you could also add a test case for this in the offload test suite.
> 
> //Eelco

Hi Eelco,

Thanks for the review. I didn't have time to check this but wanted to
reply that it's in my queue to verify your suggestion and add a test.

Thanks,
Roi

> 
>>>>> This patch adds checks in the sweep phase for such ukeys and move them
>>>>> to DELETE so that they can be cleared eventually.
>>>>>
>>>>> Co-authored-by: Han Zhou <[email protected]>
>>>>> Signed-off-by: Han Zhou <[email protected]>
>>>>> Signed-off-by: Roi Dayan <[email protected]>
>>>>> ---
>>>>>  ofproto/ofproto-dpif-upcall.c | 15 +++++++++++++++
>>>>>  1 file changed, 15 insertions(+)
>>>>>
>>>>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
>>>>> index 83609ec62b63..e9520ebdf910 100644
>>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>>> @@ -57,6 +57,7 @@ COVERAGE_DEFINE(dumped_inconsistent_flow);
>>>>>  COVERAGE_DEFINE(dumped_new_flow);
>>>>>  COVERAGE_DEFINE(handler_duplicate_upcall);
>>>>>  COVERAGE_DEFINE(revalidate_missed_dp_flow);
>>>>> +COVERAGE_DEFINE(revalidate_missed_dp_flow_del);
>>>>>  COVERAGE_DEFINE(ukey_dp_change);
>>>>>  COVERAGE_DEFINE(ukey_invalid_stat_reset);
>>>>>  COVERAGE_DEFINE(ukey_replace_contention);
>>>>> @@ -278,6 +279,7 @@ enum flow_del_reason {
>>>>>      FDR_BAD_ODP_FIT,        /* Bad ODP flow fit. */
>>>>>      FDR_FLOW_IDLE,          /* Flow idle timeout. */
>>>>>      FDR_FLOW_LIMIT,         /* Kill all flows condition reached. */
>>>>> +    FDR_FLOW_STALE,         /* Flow stale detected. */
>>>>
>>>> Please also update the associated script, see above comment.
>>>>
>>>>>      FDR_FLOW_WILDCARDED,    /* Flow needs a narrower wildcard mask. */
>>>>>      FDR_NO_OFPROTO,         /* Bridge not found. */
>>>>>      FDR_PURGE,              /* User requested flow deletion. */
>>>>> @@ -2557,6 +2559,10 @@ push_dp_ops(struct udpif *udpif, struct ukey_op 
>>>>> *ops, size_t n_ops)
>>>>
>>>>
>>>> The (op->dop.error) condition below will never be reached, as it’s 
>>>> explicitly checked above. So the error message will never happen. The 
>>>> condition you explain is already covered by UKEY_INCONSISTENT. Not sure if 
>>>> we need a log message for this either at warn level, maybe debug, 
>>>> especially as you say it can happen often.
>>>>
>>>
>>> right. as replied above. I didn't notice this when I did the cherry-pick as 
>>> git didn't
>>> show me a conflict in this area at all.
>>> This commit was done before the change so we did hit this spot. we got 
>>> little off
>>> checking the commit and sending it on time for review. sorry about that.
>>> cf11766cbcf1 ofproto-dpif-upcall: Fix push_dp_ops to handle all errors.
>>>
>>> I think we should redo our checks with the commit adding the 
>>> UKEY_INCONSISTENT
>>> state and see our result again.
>>>
>>> thanks for the review.
>>>
>>>
>>>>>          if (op->dop.type != DPIF_OP_FLOW_DEL) {
>>>>>              /* Only deleted flows need their stats pushed. */
>>>>> +            if (op->dop.error) {
>>>>> +                VLOG_WARN_RL(&rl, "push_dp_ops: error %d in op type %d, 
>>>>> ukey "
>>>>> +                             "%p", op->dop.error, op->dop.type, 
>>>>> op->ukey);
>>>>> +            }
>>>>>              continue;
>>>>>          }
>>>>>
>>>>> @@ -3027,6 +3033,15 @@ revalidator_sweep__(struct revalidator 
>>>>> *revalidator, bool purge)
>>>>>                      del_reason = purge ? FDR_PURGE : FDR_UPDATE_FAIL;
>>>>>                  } else if (!seq_mismatch) {
>>>>>                      result = UKEY_KEEP;
>>>>> +                } else if (!ukey->stats.used &&
>>>>> +                           udpif_flow_time_delta(udpif, ukey) * 1000 >
>>>>> +                           ofproto_max_idle) {
>>>>
>>>> In theory, this is a missed dp flow, i.e. the same as the else {} case 
>>>> below.
>>>> Maybe we should change revalidate_ukey() to have a test for this?
>>>>
>>>>
>>>> Maybe something like this (not tested):
>>>>
>>>> diff --git a/ofproto/ofproto-dpif-upcall.c b/ofproto/ofproto-dpif-upcall.c
>>>> index e9520ebdf..c34978fc9 100644
>>>> --- a/ofproto/ofproto-dpif-upcall.c
>>>> +++ b/ofproto/ofproto-dpif-upcall.c
>>>> @@ -279,7 +279,7 @@ enum flow_del_reason {
>>>>      FDR_BAD_ODP_FIT,        /* Bad ODP flow fit. */
>>>>      FDR_FLOW_IDLE,          /* Flow idle timeout. */
>>>>      FDR_FLOW_LIMIT,         /* Kill all flows condition reached. */
>>>> -    FDR_FLOW_STALE,         /* Flow stale detected. */
>>>> +    FDR_FLOW_NO_STATS_IDLE, /* Flow idled out without receiving 
>>>> statistics. */
>>>>      FDR_FLOW_WILDCARDED,    /* Flow needs a narrower wildcard mask. */
>>>>      FDR_NO_OFPROTO,         /* Bridge not found. */
>>>>      FDR_PURGE,              /* User requested flow deletion. */
>>>> @@ -2452,7 +2452,14 @@ revalidate_ukey(struct udpif *udpif, struct 
>>>> udpif_key *ukey,
>>>>          log_unexpected_stats_jump(ukey, stats);
>>>>      }
>>>>
>>>> -    if (need_revalidate) {
>>>> +    if (!ukey->stats.used
>>>> +        && ukey->created < udpif->dpif->current_ms - ofproto_max_idle) {
>>>> +        /* If the flow has a 'used' value of 0, meaning no stats were 
>>>> received
>>>> +         * for this flow, and the configured idle time has elapsed, it 
>>>> might
>>>> +         * indicates a stale flow (i.e., a flow without an installed 
>>>> datapath
>>>> +         * rule). In this case, it is safe to mark this ukey for 
>>>> deletion. */
>>>> +        *del_reason = FDR_FLOW_NO_STATS_IDLE;
>>>> +    } else if (need_revalidate) {
>>>>          if (should_revalidate(udpif, ukey, push.n_packets)) {
>>>>              if (!ukey->xcache) {
>>>>                  ukey->xcache = xlate_cache_new();
>>>>
>>>>
>>>> Others, thoughts?
>>>>
>>>>> +                    COVERAGE_INC(revalidate_missed_dp_flow_del);
>>>>> +                    VLOG_WARN_RL(&rl, "revalidator_sweep__: Remove stale 
>>>>> ukey "
>>>>> +                                 "%p delta %llus", ukey,
>>>>> +                                 udpif_flow_time_delta(udpif, ukey));
>>>>> +                    result = UKEY_DELETE;
>>>>> +                    del_reason = FDR_FLOW_STALE;
>>>>>                  } else {
>>>>>                      struct dpif_flow_stats stats;
>>>>>                      COVERAGE_INC(revalidate_missed_dp_flow);
>>>>> -- 
>>>>> 2.21.0
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> [email protected]
>>>>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>>>
>>>
> 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to