Hi Harman, Thank you for testing this.
> -----Original Message----- > From: Harman Kalra <hka...@marvell.com> > Sent: Thursday, October 17, 2019 19:42 > To: Ruifeng Wang (Arm Technology China) <ruifeng.w...@arm.com> > Cc: David Marchand <david.march...@redhat.com>; Aaron Conole > <acon...@redhat.com>; David Hunt <david.h...@intel.com>; dev > <dev@dpdk.org>; Gavin Hu (Arm Technology China) <gavin...@arm.com>; > Honnappa Nagarahalli <honnappa.nagaraha...@arm.com>; nd > <n...@arm.com>; dpdk stable <sta...@dpdk.org> > Subject: Re: [EXT] RE: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix > deadlock issue for aarch64 > > Hi > > I tested this patch, following are my observations: > 1. With this patch distributor_autotest getting suspended on arm64 platform > is resolved. But continous execution of this test results in test failure, as > reported by Aaron. > 2. While testing on x86 platform, still I can observe distributor_autotest > getting suspeneded(stuck) on continous execution of the test (it took almost > 7-8 iterations to reproduce the suspension). Yes, this v1 patch is not complete to solve the issue. I have posted v3: http://patches.dpdk.org/project/dpdk/list/?series=6856 With the new patch set, I didn't observe test failure in my test. Will you try that? Thanks. /Ruifeng > > Thanks > > On Wed, Oct 09, 2019 at 05:52:03AM +0000, Ruifeng Wang (Arm Technology > China) wrote: > > External Email > > > > ---------------------------------------------------------------------- > > > > > -----Original Message----- > > > From: David Marchand <david.march...@redhat.com> > > > Sent: Wednesday, October 9, 2019 03:47 > > > To: Aaron Conole <acon...@redhat.com> > > > Cc: Ruifeng Wang (Arm Technology China) <ruifeng.w...@arm.com>; > > > David Hunt <david.h...@intel.com>; dev <dev@dpdk.org>; > > > hka...@marvell.com; Gavin Hu (Arm Technology China) > > > <gavin...@arm.com>; Honnappa Nagarahalli > > > <honnappa.nagaraha...@arm.com>; nd <n...@arm.com>; dpdk stable > > > <sta...@dpdk.org> > > > Subject: Re: [dpdk-stable] [dpdk-dev] [PATCH] lib/distributor: fix > > > deadlock issue for aarch64 > > > > > > On Tue, Oct 8, 2019 at 7:06 PM Aaron Conole <acon...@redhat.com> > wrote: > > > > > > > > Ruifeng Wang <ruifeng.w...@arm.com> writes: > > > > > > > > > Distributor and worker threads rely on data structs in cache > > > > > line for synchronization. The shared data structs were not protected. > > > > > This caused deadlock issue on weaker memory ordering platforms > > > > > as aarch64. > > > > > Fix this issue by adding memory barriers to ensure > > > > > synchronization among cores. > > > > > > > > > > Bugzilla ID: 342 > > > > > Fixes: 775003ad2f96 ("distributor: add new burst-capable > > > > > library") > > > > > Cc: sta...@dpdk.org > > > > > > > > > > Signed-off-by: Ruifeng Wang <ruifeng.w...@arm.com> > > > > > Reviewed-by: Gavin Hu <gavin...@arm.com> > > > > > --- > > > > > > > > I see a failure in the distributor_autotest (on one of the builds): > > > > > > > > 64/82 DPDK:fast-tests / distributor_autotest FAIL 0.37 s (exit > > > > status > 255 > > > or signal 127 SIGinvalid) > > > > > > > > --- command --- > > > > > > > > DPDK_TEST='distributor_autotest' > > > > /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 > > > > --file-prefix=distributor_autotest > > > > > > > > --- stdout --- > > > > > > > > EAL: Probing VFIO support... > > > > > > > > APP: HPET is not enabled, using TSC as default timer > > > > > > > > RTE>>distributor_autotest > > > > > > > > === Basic distributor sanity tests === > > > > > > > > Worker 0 handled 32 packets > > > > > > > > Sanity test with all zero hashes done. > > > > > > > > Worker 0 handled 32 packets > > > > > > > > Sanity test with non-zero hashes done > > > > > > > > === testing big burst (single) === > > > > > > > > Sanity test of returned packets done > > > > > > > > === Sanity test with mbuf alloc/free (single) === > > > > > > > > Sanity test with mbuf alloc/free passed > > > > > > > > Too few cores to run worker shutdown test > > > > > > > > === Basic distributor sanity tests === > > > > > > > > Worker 0 handled 32 packets > > > > > > > > Sanity test with all zero hashes done. > > > > > > > > Worker 0 handled 32 packets > > > > > > > > Sanity test with non-zero hashes done > > > > > > > > === testing big burst (burst) === > > > > > > > > Sanity test of returned packets done > > > > > > > > === Sanity test with mbuf alloc/free (burst) === > > > > > > > > Line 326: Packet count is incorrect, 1048568, expected 1048576 > > > > > > > > Test Failed > > > > > > > > RTE>> > > > > > > > > --- stderr --- > > > > > > > > EAL: Detected 2 lcore(s) > > > > > > > > EAL: Detected 1 NUMA nodes > > > > > > > > EAL: Multi-process socket > > > > /var/run/dpdk/distributor_autotest/mp_socket > > > > > > > > EAL: Selected IOVA mode 'PA' > > > > > > > > EAL: No available hugepages reported in hugepages-1048576kB > > > > > > > > ------- > > > > > > > > Not sure how to help debug further. I'll re-start the job to see > > > > if it 'clears' up - but I guess there may be a delicate > > > > synchronization somewhere that needs to be accounted. > > > > > > Idem, and with the same loop I used before, it can be caught quickly. > > > > > > # time (log=/tmp/$$.log; while true; do echo distributor_autotest > > > |taskset -c 0-1 ./build-gcc-static/app/test/dpdk-test --log-level > > > |*:8 > > > -l 0-1 >$log 2>&1; grep -q 'Test OK' $log || break; done; cat $log; > > > rm -f $log) > > > > > Thanks Aaron and David for your report. I can reproduce this issue with the > script. > > Will fix it in next version. > > > > > [snip] > > > > > > RTE>>distributor_autotest > > > EAL: Trying to obtain current memory policy. > > > EAL: Setting policy MPOL_PREFERRED for socket 0 > > > EAL: Restoring previous memory policy: 0 > > > EAL: request: mp_malloc_sync > > > EAL: Heap on socket 0 was expanded by 2MB > > > EAL: Trying to obtain current memory policy. > > > EAL: Setting policy MPOL_PREFERRED for socket 0 > > > EAL: Restoring previous memory policy: 0 > > > EAL: alloc_pages_on_heap(): couldn't allocate physically contiguous > > > space > > > EAL: Trying to obtain current memory policy. > > > EAL: Setting policy MPOL_PREFERRED for socket 0 > > > EAL: Restoring previous memory policy: 0 > > > EAL: request: mp_malloc_sync > > > EAL: Heap on socket 0 was expanded by 8MB === Basic distributor > > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero > hashes done. > > > Worker 0 handled 32 packets > > > Sanity test with non-zero hashes done === testing big burst (single) > > > === Sanity test of returned packets done > > > > > > === Sanity test with mbuf alloc/free (single) === Sanity test with > > > mbuf alloc/free passed > > > > > > Too few cores to run worker shutdown test === Basic distributor > > > sanity tests === Worker 0 handled 32 packets Sanity test with all zero > hashes done. > > > Worker 0 handled 32 packets > > > Sanity test with non-zero hashes done === testing big burst (burst) > > > === Sanity test of returned packets done > > > > > > === Sanity test with mbuf alloc/free (burst) === Line 326: Packet > > > count is incorrect, 1048568, expected 1048576 Test Failed > > > RTE>> > > > real 0m36.668s > > > user 1m7.293s > > > sys 0m1.560s > > > > > > Could be worth running this loop on all tests? (not talking about > > > the CI, it would be a manual effort to catch lurking issues). > > > > > > > > > -- > > > David Marchand