On 8 Jun 2026, at 20:42, Ilya Maximets wrote:
> The BFD decay test disables BFD on one of the ports, making both sides
> to go Down. Then it re-enables BFD and expects them to be Up within
> 1.5 seconds. This seems reasonable given the 300-500 ms configured
> timings. However, while not in the Up state, the minimal transmission
> time is increased to be at least 1,000,000 microseconds, according to
> RFC 5880 Section 6.8.3:
>
> When bfd.SessionState is not Up, the system MUST set
> bfd.DesiredMinTxInterval to a value of not less than one second
> (1,000,000 microseconds). This is intended to ensure that the
> bandwidth consumed by BFD sessions that are not Up is negligible,
> particularly in the case where a neighbor may not be running BFD.
>
> And this is correctly implemented in bfd_min_tx() function.
>
> Since both sides are not Up, it takes at least two round trips for the
> states to converge. There is a 25% randomness baked into the messages,
> so it is at least 750 ms per message, i.e., at least 1500 ms total, if
> we're very lucky.
>
> There is extra overhead in the test due to execution of the unixctl
> commands, actual packet processing, and the time it takes to execute
> the next checks. That seems to push the timing a little and make the
> overall wait of just 1500 ms enough for the test to pass. However,
> if the randomness is not in our favor, it may not be enough. Ideally,
> we need at least 2000 ms, or better 2500 ms, to be sure that all
> exchanges are complete and the states are properly set. To be safe,
> it might be better to use 3500 ms even.
>
> 3500 ms should not be enough to trigger decay, as state changes reset
> the decay timer. So, increasing the wait times this way should not
> affect the later checks.
>
> Without this change, the BFD decay test fails on my laptop in ~3% of
> the cases. With this change, I was not able to reproduce the failure
> after 1500 iterations.
>
> We see occasional failures of this test in our CI, but they are mostly
> covered by the automatic re-check. It's rare to see the test fail
> twice in a row to trigger the full CI failure, but it definitely does
> happen from time to time. The failures tend to be more frequent on
> different architectures like arm or s390. This test was flaky for as
> long as I remember working on OVS.
>
> I'm not sure if this change covers all the failures of this particular
> test, but it definitely covers a lot of them.
>
> Fixes: c1c4e8c76912 ("bfd: Implement BFD decay.")
> Signed-off-by: Ilya Maximets <[email protected]>
Thanks for fixing this Ilya. The change looks good to me.
Acked-by: Eelco Chaudron <[email protected]>
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev