On 6/9/26 9:53 AM, Eelco Chaudron wrote:
>
>
> On 8 Jun 2026, at 20:42, Ilya Maximets wrote:
>
>> The BFD decay test disables BFD on one of the ports, making both sides
>> to go Down. Then it re-enables BFD and expects them to be Up within
>> 1.5 seconds. This seems reasonable given the 300-500 ms configured
>> timings. However, while not in the Up state, the minimal transmission
>> time is increased to be at least 1,000,000 microseconds, according to
>> RFC 5880 Section 6.8.3:
>>
>> When bfd.SessionState is not Up, the system MUST set
>> bfd.DesiredMinTxInterval to a value of not less than one second
>> (1,000,000 microseconds). This is intended to ensure that the
>> bandwidth consumed by BFD sessions that are not Up is negligible,
>> particularly in the case where a neighbor may not be running BFD.
>>
>> And this is correctly implemented in bfd_min_tx() function.
>>
>> Since both sides are not Up, it takes at least two round trips for the
>> states to converge. There is a 25% randomness baked into the messages,
>> so it is at least 750 ms per message, i.e., at least 1500 ms total, if
>> we're very lucky.
>>
>> There is extra overhead in the test due to execution of the unixctl
>> commands, actual packet processing, and the time it takes to execute
>> the next checks. That seems to push the timing a little and make the
>> overall wait of just 1500 ms enough for the test to pass. However,
>> if the randomness is not in our favor, it may not be enough. Ideally,
>> we need at least 2000 ms, or better 2500 ms, to be sure that all
>> exchanges are complete and the states are properly set. To be safe,
>> it might be better to use 3500 ms even.
>>
>> 3500 ms should not be enough to trigger decay, as state changes reset
>> the decay timer. So, increasing the wait times this way should not
>> affect the later checks.
>>
>> Without this change, the BFD decay test fails on my laptop in ~3% of
>> the cases. With this change, I was not able to reproduce the failure
>> after 1500 iterations.
>>
>> We see occasional failures of this test in our CI, but they are mostly
>> covered by the automatic re-check. It's rare to see the test fail
>> twice in a row to trigger the full CI failure, but it definitely does
>> happen from time to time. The failures tend to be more frequent on
>> different architectures like arm or s390. This test was flaky for as
>> long as I remember working on OVS.
>>
>> I'm not sure if this change covers all the failures of this particular
>> test, but it definitely covers a lot of them.
>>
>> Fixes: c1c4e8c76912 ("bfd: Implement BFD decay.")
>> Signed-off-by: Ilya Maximets <[email protected]>
>
> Thanks for fixing this Ilya. The change looks good to me.
>
> Acked-by: Eelco Chaudron <[email protected]>
>
Thanks! Applied to all supported branches.
Best regards, Ilya Maximets.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev