Re: [ovs-dev] Reliability of system-offload test #50 [Was: Re: [PATCH v3 2/2] ci: Run tc offload tests in GitHub] Actions.

Simon Horman Wed, 29 Mar 2023 08:35:23 -0700

On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote:
> 
> 
> On 10 Mar 2023, at 17:20, Simon Horman wrote:
> 
> > On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote:
> >> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote:
> >>>
> >>>
> >>> On 9 Mar 2023, at 15:42, Simon Horman wrote:
> >>>
> >>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote:
> >>>>> Run "make check-offloads" as part of the GitHub actions tests.
> >>>>>
> >>>>> This test was run 25 times using GitHub actions, and the
> >>>>> failing rerun test cases where excluded. There are quite some
> >>>>> first-run failures, but unfortunately, there is no other
> >>>>> more stable kernel available as a GitHub-hosted runner.
> >>>>>
> >>>>> Did not yet include sanitizers in the run, as it's causing
> >>>>> the test to run too long >30min and there seems to be (timing)
> >>>>> issues with some of the tests.
> >>>>>
> >>>>> Signed-off-by: Eelco Chaudron <[email protected]>
> >>>>
> >>>> Hi Eelco,
> >>>>
> >>>> I like this patch a lot.
> >>>> But I am observing reliability problems when executing the new job.
> >>>>
> >>>> For 5 runs, on each occasion some tests failed the first time.
> >>>> And on 3 of those runs at least one test failed on the recheck,
> >>>> so the job failed.
> >>>
> >>> Damn :)
> >>
> >> Yes, it pained me to report this.
> >>
> >>> I did 25 runs (I did not check for re-runs), and they were fine. I also 
> >>> cleaned up my jobs recently, so I no longer have them.
> >>>
> >>> I can do this again and figure out wich tests are failing. Then analyze 
> >>> the failures to see if we need to exclude them or can fine-tune them.
> >>
> >> I will see if I can spend some cycles on reproducing this (outside of GHA).
> >> I'll likely start with the tests that show up in the summary below.
> >
> > I started off by looking at check-offloads test:
> >
> > 50. system-traffic.at:1524: testing datapath - basic truncate action ...
> >
> > I haven't dug into the code to debug the problem yet.
> > But I have collected some results that might be interesting.
> >
> >
> > My testing was on a low-end VM with Ubuntu 18.04, with no HW offload:
> > $ uname -psv
> > Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64
> 
> 
> So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like 
> on GitHub) and ran the tests.
> 
> I thought I fixed it by this old commit:
> 
>   
> https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac
> 
> However, as you can see even after excluding the remaining failures I could 
> not figure out, it still fails randomly:
> 
>   https://github.com/chaudron/ovs/actions
> 
> Note that the above 25x runs I did before and none of the above tests failed…
> 
> I was not able to make any of these tests fail on my local Ubuntu, and also 
> analysing the results did not lead to a specific thing to fix.
> 
> As this is working fine on my Fedora (VM) setup for multiple runs without any 
> problem I’ll abandon this patch now :( I’ll try to get a buy-in form the 
> Robot to run the datapath tests as part of its sanity check for now…


Thanks Eelco,

it's a shame this proved to be elusive.

Perhaps with a newer, as yet unreleased, Ubuntu version things will
improve - perhaps it is a kernel issue.

Or perhaps we have some deep problem related to running in
resource constrained environments, that we may uncover some day.

In any case, thanks for looking at this.
I agree that it makes sense to abandon this patchset for now.
And that using the Robot may be a promising alternative, for now.
_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] Reliability of system-offload test #50 [Was: Re: [PATCH v3 2/2] ci: Run tc offload tests in GitHub] Actions.

Reply via email to