>-----Original Message----- >From: Honnappa Nagarahalli [mailto:honnappa.nagaraha...@arm.com] >Sent: Monday, November 5, 2018 10:08 PM >To: Jerin Jacob <jerin.ja...@caviumnetworks.com> >Cc: Richardson, Bruce <bruce.richard...@intel.com>; De Lara Guarch, Pablo ><pablo.de.lara.gua...@intel.com>; dev@dpdk.org; Wang, >Yipeng1 <yipeng1.w...@intel.com>; Dharmik Thakkar <dharmik.thak...@arm.com>; >Gavin Hu (Arm Technology China) ><gavin...@arm.com>; nd <n...@arm.com>; tho...@monjalon.net; Yigit, Ferruh ><ferruh.yi...@intel.com>; >hemant.agra...@nxp.com; chao...@linux.vnet.ibm.com; nd <n...@arm.com> >Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write >concurrency >> > >> > 9) Does anyone else facing this problem? >Any data on x86? > [Wang, Yipeng] I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk and SIMD instruction so there is no obvious throughput drop on both hit and miss cases (for hit case, there is about 2.5% drop though).
I manually changed l3fwd to do single packet lookup instead of bulk. For hit case there is no throughput drop. For miss case, there is 10% throughput drop. I dig into it, as expected, atomic load indeed translates to regular mov on x86. But since the reordering of the instruction, the compiler(gcc 5.4) cannot unroll the for loop to a switch-case like assembly as before. So I believe the reason of performance drops on x86 is because compiler cannot optimize the code as well as previously. I guess this is totally different reason from why your performance drop on non-TSO machine. On non-TSO machine, probably the excessive number of atomic load causes a lot of overhead. A quick fix I found useful on x86 is to read all index together. I am no expert on the use of atomic intinsics, but I assume By adding a fence should still maintain the correct ordering? - uint32_t key_idx; + uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES]; void *pdata; struct rte_hash_key *k, *keys = h->key_store; + memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES); + __atomic_thread_fence(__ATOMIC_ACQUIRE); + for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) { - key_idx = __atomic_load_n(&bkt->key_idx[i], - __ATOMIC_ACQUIRE); - if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) { + if (bkt->sig_current[i] == sig && key_idx[i] != EMPTY_SLOT){ Yipeng