On 6/9/26 9:53 AM, Eelco Chaudron wrote:
> 
> 
> On 8 Jun 2026, at 20:42, Ilya Maximets wrote:
> 
>> The BFD decay test disables BFD on one of the ports, making both sides
>> to go Down.  Then it re-enables BFD and expects them to be Up within
>> 1.5 seconds.  This seems reasonable given the 300-500 ms configured
>> timings.  However, while not in the Up state, the minimal transmission
>> time is increased to be at least 1,000,000 microseconds, according to
>> RFC 5880 Section 6.8.3:
>>
>>    When bfd.SessionState is not Up, the system MUST set
>>    bfd.DesiredMinTxInterval to a value of not less than one second
>>    (1,000,000 microseconds).  This is intended to ensure that the
>>    bandwidth consumed by BFD sessions that are not Up is negligible,
>>    particularly in the case where a neighbor may not be running BFD.
>>
>> And this is correctly implemented in bfd_min_tx() function.
>>
>> Since both sides are not Up, it takes at least two round trips for the
>> states to converge.  There is a 25% randomness baked into the messages,
>> so it is at least 750 ms per message, i.e., at least 1500 ms total, if
>> we're very lucky.
>>
>> There is extra overhead in the test due to execution of the unixctl
>> commands, actual packet processing, and the time it takes to execute
>> the next checks.  That seems to push the timing a little and make the
>> overall wait of just 1500 ms enough for the test to pass.  However,
>> if the randomness is not in our favor, it may not be enough.  Ideally,
>> we need at least 2000 ms, or better 2500 ms, to be sure that all
>> exchanges are complete and the states are properly set.  To be safe,
>> it might be better to use 3500 ms even.
>>
>> 3500 ms should not be enough to trigger decay, as state changes reset
>> the decay timer.  So, increasing the wait times this way should not
>> affect the later checks.
>>
>> Without this change, the BFD decay test fails on my laptop in ~3% of
>> the cases.  With this change, I was not able to reproduce the failure
>> after 1500 iterations.
>>
>> We see occasional failures of this test in our CI, but they are mostly
>> covered by the automatic re-check.  It's rare to see the test fail
>> twice in a row to trigger the full CI failure, but it definitely does
>> happen from time to time.  The failures tend to be more frequent on
>> different architectures like arm or s390.  This test was flaky for as
>> long as I remember working on OVS.
>>
>> I'm not sure if this change covers all the failures of this particular
>> test, but it definitely covers a lot of them.
>>
>> Fixes: c1c4e8c76912 ("bfd: Implement BFD decay.")
>> Signed-off-by: Ilya Maximets <[email protected]>
> 
> Thanks for fixing this Ilya. The change looks good to me.
> 
> Acked-by: Eelco Chaudron <[email protected]>
> 

Thanks!  Applied to all supported branches.

Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to