<snip> > Subject: Re: [dpdk-dev] Random failure in service_autotest > > Lukasz Wojciechowski <l.wojciec...@partner.samsung.com> writes: > > > W dniu 17.07.2020 o 17:19, David Marchand pisze: > >> On Fri, Jul 17, 2020 at 10:56 AM David Marchand > >> <david.march...@redhat.com> wrote: > >>> On Wed, Jul 15, 2020 at 12:41 PM Ferruh Yigit <ferruh.yi...@intel.com> > wrote: > >>>> On 7/15/2020 11:14 AM, David Marchand wrote: > >>>>> Hello Harry and guys who touched the service code recently :-) > >>>>> > >>>>> I spotted a failure for the service UT in Travis: > >>>>> https://travis-ci.com/github/ovsrobot/dpdk/jobs/361097992#L18697 > >>>>> > >>>>> I found only a single instance of this failure and tried to > >>>>> reproduce it with my usual "brute" active loop with no success so far. > >>>> +1, I didn't able to reproduce it in my environment but observed it > >>>> +in the > >>>> Travis CI. > >>>> > >>>>> Any chance it could be due to recent changes? > >>>>> https://protect2.fireeye.com/url?k=70a801b3-2d7b5aa7-70a98afc-0cc4 > >>>>> 7a31ce4e- > 231dc7b8ee6eb8a9&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk% > >>>>> 2Fcommit%2F%3Fid%3Df3c256b621262e581d3edcca383df83875ab7ebe > >>>>> https://protect2.fireeye.com/url?k=21dbcfd3-7c0894c7-21da449c-0cc4 > >>>>> 7a31ce4e- > d8c6abfb03bf67f1&q=1&u=https%3A%2F%2Fgit.dpdk.org%2Fdpdk% > >>>>> 2Fcommit%2F%3Fid%3D048db4b6dcccaee9277ce5b4fbb2fe684b212e22 > >>> I can see more occurrences of the issue in the CI. > >>> I just applied the patch changing the log level for test assert, in > >>> the hope it will help. > >> And... we just got one with logs: > >> https://travis-ci.com/github/ovsrobot/dpdk/jobs/362109882#L18948 > >> > >> EAL: Test assert service_lcore_attr_get line 396 failed: > >> lcore_attr_get() didn't get correct loop count (zero) > >> > >> It looks like a race between the service core still running and the > >> core resetting the loops attr. > >> > > Yes, it seems to be just lack of patience of the test. It should wait > > a bit for lcore to stop before resetting attrs. > > Something like this should help: > > @@ -384,6 +384,9 @@ service_lcore_attr_get(void) > > > > rte_service_lcore_stop(slcore_id); > > > > + /* wait for the service lcore to stop */ > > + rte_delay_ms(200); > > + > > TEST_ASSERT_EQUAL(0, > > rte_service_lcore_attr_reset_all(slcore_id), > > "Valid lcore_attr_reset_all() didn't return > > success"); > > Would an rte_eal_wait_lcore make sense? Overall, I really dislike sleeps > because they can hide racy synchronization points. I think something like below might be better.
diff --git a/app/test/test_service_cores.c b/app/test/test_service_cores.c index ef1d8fcb9..f0bedbe5e 100644 --- a/app/test/test_service_cores.c +++ b/app/test/test_service_cores.c @@ -384,6 +384,16 @@ service_lcore_attr_get(void) rte_service_lcore_stop(slcore_id); + /* give the service 200ms to stop running */ + for (i = 0; i < 200; i++) { + if (!rte_service_may_be_active(sid)) + break; + rte_delay_ms(SERVICE_DELAY); + } + + TEST_ASSERT_EQUAL(0, rte_service_may_be_active(sid), + "Error: Service not stopped after 200ms"); + TEST_ASSERT_EQUAL(0, rte_service_lcore_attr_reset_all(slcore_id), "Valid lcore_attr_reset_all() didn't return success");