On Tue, Mar 28, 2023 at 01:45:22PM +0200, Eelco Chaudron wrote: > > > On 10 Mar 2023, at 17:20, Simon Horman wrote: > > > On Fri, Mar 10, 2023 at 10:15:44AM +0100, Simon Horman wrote: > >> On Thu, Mar 09, 2023 at 05:22:43PM +0100, Eelco Chaudron wrote: > >>> > >>> > >>> On 9 Mar 2023, at 15:42, Simon Horman wrote: > >>> > >>>> On Wed, Mar 08, 2023 at 04:18:47PM +0100, Eelco Chaudron wrote: > >>>>> Run "make check-offloads" as part of the GitHub actions tests. > >>>>> > >>>>> This test was run 25 times using GitHub actions, and the > >>>>> failing rerun test cases where excluded. There are quite some > >>>>> first-run failures, but unfortunately, there is no other > >>>>> more stable kernel available as a GitHub-hosted runner. > >>>>> > >>>>> Did not yet include sanitizers in the run, as it's causing > >>>>> the test to run too long >30min and there seems to be (timing) > >>>>> issues with some of the tests. > >>>>> > >>>>> Signed-off-by: Eelco Chaudron <[email protected]> > >>>> > >>>> Hi Eelco, > >>>> > >>>> I like this patch a lot. > >>>> But I am observing reliability problems when executing the new job. > >>>> > >>>> For 5 runs, on each occasion some tests failed the first time. > >>>> And on 3 of those runs at least one test failed on the recheck, > >>>> so the job failed. > >>> > >>> Damn :) > >> > >> Yes, it pained me to report this. > >> > >>> I did 25 runs (I did not check for re-runs), and they were fine. I also > >>> cleaned up my jobs recently, so I no longer have them. > >>> > >>> I can do this again and figure out wich tests are failing. Then analyze > >>> the failures to see if we need to exclude them or can fine-tune them. > >> > >> I will see if I can spend some cycles on reproducing this (outside of GHA). > >> I'll likely start with the tests that show up in the summary below. > > > > I started off by looking at check-offloads test: > > > > 50. system-traffic.at:1524: testing datapath - basic truncate action ... > > > > I haven't dug into the code to debug the problem yet. > > But I have collected some results that might be interesting. > > > > > > My testing was on a low-end VM with Ubuntu 18.04, with no HW offload: > > $ uname -psv > > Linux #56-Ubuntu SMP Tue Sep 20 13:23:26 UTC 2022 x86_64 > > > So I took the same approach, I have local vagrant VM with ubuntu 22.11 (like > on GitHub) and ran the tests. > > I thought I fixed it by this old commit: > > > https://github.com/openvswitch/ovs/commit/22968988b820aa17a9b050c901208b7d4bed9dac > > However, as you can see even after excluding the remaining failures I could > not figure out, it still fails randomly: > > https://github.com/chaudron/ovs/actions > > Note that the above 25x runs I did before and none of the above tests failed… > > I was not able to make any of these tests fail on my local Ubuntu, and also > analysing the results did not lead to a specific thing to fix. > > As this is working fine on my Fedora (VM) setup for multiple runs without any > problem I’ll abandon this patch now :( I’ll try to get a buy-in form the > Robot to run the datapath tests as part of its sanity check for now…
Thanks Eelco, it's a shame this proved to be elusive. Perhaps with a newer, as yet unreleased, Ubuntu version things will improve - perhaps it is a kernel issue. Or perhaps we have some deep problem related to running in resource constrained environments, that we may uncover some day. In any case, thanks for looking at this. I agree that it makes sense to abandon this patchset for now. And that using the Robot may be a promising alternative, for now. _______________________________________________ dev mailing list [email protected] https://mail.openvswitch.org/mailman/listinfo/ovs-dev
