Alexander Kozyrev <akozy...@mellanox.com> writes: <snip> > > > > > > > Subject: RE: [dpdk-dev] [PATCH v3] net/mlx5: relaxed ordering for > > > > > multi- packet RQ buffer refcnt > > > > > > > > > > Hi Phil Yang, we noticed that this patch gives us 10% of > > > > > performance degradation on ARM. > > > > > x86 seems to be unaffected though. Do you know what may be the > > > > > reason of this behavior? > > > > > > > > Hi Alexander, > > > > > > > > Thanks for your feedback. > > > > This patch removed some expensive memory barriers on aarch64, it > > > > should get better performance. > > > > I am not sure the root cause of this degradation now, I will start > > > > the investigation. We can profiling this issue together. > > > > Could you share your test case(including your testbed configuration) > > > > with > > > us? <...> > > > > > > I'm surprised too, Phil, but looks like it is actually making things > > > worse. I used Connect-X 6DX on aarch64: > > > Linux dragon71-bf 5.4.31-mlnx.15.ge938819 #1 SMP PREEMPT Thu Jul 2 > > > 17:01:15 IDT 2020 aarch64 aarch64 aarch64 GNU/Linux Traffic generator > > > sends 60 bytes packets and DUT executes the following command: > > > arm64-bluefield-linuxapp-gcc/build/app/test-pmd/testpmd -n 4 -w > > > 0000:03:00.1,mprq_en=1,rxqs_min_mprq=1 -w > > > 0000:03:00.0,mprq_en=1,rxqs_min_mprq=1 -c 0xe -- --burst=64 -- > > > mbcache=512 -i --nb-cores=1 --rxq=1 --txq=1 --txd=256 --rxd=256 > > > --auto- start --rss-udp Without a patch I'm getting 3.2mpps, and only > > > 2.9mpps when the patch is applied. > > You are running on A72 cores, is that correct? > > Correct, cat /proc/cpuinfo > processor : 0 > BogoMIPS : 312.50 > Features : fp asimd evtstrm crc32 cpuid > CPU implementer : 0x41 > CPU architecture: 8 > CPU variant : 0x0 > CPU part : 0xd08 > CPU revision : 3
Thanks a lot for your input, Alex. With your test command line, I remeasured this patch on two different aarch64 machines and both got some performance improvement. SOC#1. On Thunderx2 (with LSE support), I see 7.6% performance improvement on throughput. NIC: ConnectX-6 / driver: mlx5_core version: 5.0-1.0.0.0 / firmware-version: 20.27.1016 (MT_0000000224) SOC#2. On N1SDP (I disabled LSE to generate A72 likewise instructions), I also see slightly (about 1%~2%) performance improvement on throughput. NIC: ConnectX-5 / driver: mlx5_core / version: 5.0-2.1.8 / firmware-version: 16.27.2008 (MT_0000000090) Without LSE (i.e. A72 and SOC#2 case.) it uses the 'Exclusive' mechanism to achieve atomicity. For example, it generates below instructions for __atomic_add_fetch. __atomic_add_fetch(&buf->refcnt, 1, __ATOMIC_ACQUIRE); 70118: f94037e3 ldr x3, [sp, #104] 7011c: 91002060 add x0, x3, #0x8 70120: 485ffc02 ldaxrh w2, [x0] 70124: 11000442 add w2, w2, #0x1 70128: 48057c02 stxrh w5, w2, [x0] 7012c: 35ffffa5 cbnz w5, 70120 <mlx5_rx_burst_mprq+0xb48> In general, I think this patch will not lead to a sharp decline in performance. Maybe you can try other testbeds? Thanks, Phil