On 6/17/26 6:03 PM, Xavier Simonart wrote:
> Hi Ilya,
> 
> Thanks for the review.
> 
> On Tue, Jun 16, 2026 at 6:30 PM Ilya Maximets <[email protected] 
> <mailto:[email protected]>> wrote:
> 
>     On 6/16/26 5:46 PM, Xavier Simonart via dev wrote:
>     > As in [0], multiple load balancing system tests are randomly failing 
> from
>     > time to time as they check that, after 10 or 20 requests sent to load
>     > balancer, all backends are at least reached once. Statistically, this is
>     > failing from time to time.
>     > [1] fixed such issues, but there are new occurrences.
>     > If after 10 requests we did not get the expected distribution, we
>     > send 10 more requests. We do that up to 30 times.
> 
>     Hi, Xavier.  Are you sure this is what is happening here?
>     The chance that all 20 requests are sent to the same backend
>     supposed to be 1 to 2^20, which is a very small chance and
>     so it should not really happen in practice.  Maybe there is
>     a different reason here after all?  How frequently you see
>     the test failures?
> 
> It did not happen when we send 20 requests but it occurred during a test 
> where we "only"  send 10 requests
> (see the fix around line 17733), and we can see what's happening in the 
> tcpdumps.
> I then changed all occurrences of that same pattern.
> However I agree that with 20 requests the probability becomes really low.
> With 10 requests it happens more often than we might think: we have roughly 
> two patches a day,
> ovs-robot runs 4x system-tests (gcc, clang, userspace, dpdk), and we have 
> roughly 40 occurrences of
> this pattern in system tests. So we run through this ~300 times per day...

Yeah, I agree that 10 is indeed too low.

> 
> 
>     >
>     > [0] 
> https://github.com/ovsrobot/ovn/actions/runs/27547031217/job/81423590350 
> <https://github.com/ovsrobot/ovn/actions/runs/27547031217/job/81423590350>
>     > [1] c906da4f1dea: tests: Fixed load balancing system-tests
>     >
>     > Fixes: 40a686e8e70f ("Add IPv6 support for lb health-check")
>     > Fixes: 33cfa4655fd7 ("tests: Move SCTP test from kernel only to general 
> OVN system tests.")
>     > Fixes: da5529438342 ("northd: Do not drop ip traffic with destination 
> vip expressed via template vars.")
>     > Signed-off-by: Xavier Simonart <[email protected] 
> <mailto:[email protected]>>
>     > ---
>     >  tests/system-ovn.at <http://system-ovn.at> | 84 
> +++++++++++++++++++++------------------------
>     >  1 file changed, 39 insertions(+), 45 deletions(-)
>     >
>     > diff --git a/tests/system-ovn.at <http://system-ovn.at> 
> b/tests/system-ovn.at <http://system-ovn.at>
>     > index 35df0ec2f..2cadbc6a7 100644
>     > --- a/tests/system-ovn.at <http://system-ovn.at>
>     > +++ b/tests/system-ovn.at <http://system-ovn.at>
>     > @@ -5143,15 +5143,15 @@ OVS_WAIT_UNTIL(
>     >  )
>     > 
>     >  # From sw0-p2 send traffic to vip - 2001::a
>     > -for i in `seq 1 20`; do
>     > -    echo Request $i
>     > -    ovn-sbctl list service_monitor
>     > -    NS_CHECK_EXEC([sw0-p2], [wget http://[[2001::a]] -t 5 -T 1 
> --retry-connrefused -v -o wget$i.log])
>     > -done
>     > +OVS_WAIT_FOR_OUTPUT([
>     > +    for i in `seq 1 20`; do
>     > +        ovn-sbctl list service_monitor >> service_monitor.log
>     > +        NS_EXEC([sw0-p2], [wget http://[[2001::a]] -t 5 -T 1 
> --retry-connrefused -v -o wget$i.log])
> 
>     I don't think this is a good change to replace NS_CHECK_EXEC
>     with a simple NS_EXEC.  As explained in commit:
>       b087f2556514 ("tests: system-ovn: Fix force SNAT IP in load-balancer 
> template test.")
>     It will take forever for this test to fail if there is an actual
>     issue in the pipeline and the packets are not delivered / conntrack
>     entries are not created.  It will take about 2.5 hours for the test
>     to actually fail, IIUC.  We should not have that.
> 
> I do not think that we can run NS_EXEC within OVS_WAIT_FOR_OUTPUT.
> So, instead I could simply ensure that we send 20 requests (i.e. only change 
> the test which sends
> 10 for now). This should be enough to reduce the number of failures to less 
> than one per year,
> and we can keep NS_CHECK_EXEC.
> I'll send v2.

With the NS_CHECK_EXEC we could do 30 even, I guess, or 25, if it's not
too slow on a happy path.  But we need to CHECK.

> 
> 
>     Best regards, Ilya Maximets.
> 
> Thanks
> Xavier 

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Reply via email to