On 10/1/2025 9:56 AM, Shaiq Wani wrote:
In case some CPUs don't support AVX512. Enable AVX2 for them to
get better per-core performance.
In the single queue model, the same descriptor queue is used by SW
to post descriptors to the device and used by device to report completed
descriptors to SW. While as the split queue model separates them into
different queues for parallel processing and improved performance.
Signed-off-by: Shaiq Wani <[email protected]>
---
Hi Shaiq,
<snip>
> +RTE_EXPORT_INTERNAL_SYMBOL(idpf_splitq_rearm_common)
> +void
> +idpf_splitq_rearm_common(struct idpf_rx_queue *rx_bufq)
> +{
> + struct rte_mbuf **rxp = &rx_bufq->sw_ring[rx_bufq->rxrearm_start];
> + volatile union virtchnl2_rx_buf_desc *rxdp = rx_bufq->rx_ring;
> + uint16_t rx_id;
> + int i;
> +
> + rxdp += rx_bufq->rxrearm_start;
> +
> + /* Pull 'n' more MBUFs into the software ring */
> + if (rte_mbuf_raw_alloc_bulk(rx_bufq->mp,
> + (void *)rxp,
> + IDPF_RXQ_REARM_THRESH) < 0) {
> + if (rx_bufq->rxrearm_nb + IDPF_RXQ_REARM_THRESH >=
> + rx_bufq->nb_rx_desc) {
> + __m128i dma_addr0;
> +
> + dma_addr0 = _mm_setzero_si128();
> + for (i = 0; i < IDPF_VPMD_DESCS_PER_LOOP; i++) {
> + rxp[i] = &rx_bufq->fake_mbuf;
> + _mm_store_si128(RTE_CAST_PTR(__m128i *, &rxdp[i]),
> + dma_addr0);
This is common code (including non-x86 platforms), you can't use
x86-specific intrinsics here.
+ for (uint16_t i = 0; i < nb_pkts;
+ i += IDPF_VPMD_DESCS_PER_LOOP,
+ rxdp += IDPF_VPMD_DESCS_PER_LOOP) {
+ /* Step 1: copy 4 mbuf pointers (64-bit each) into rx_pkts[] */
+#ifdef RTE_ARCH_X86_64
+ __m128i ptrs_lo = _mm_loadu_si128((const __m128i *)&sw_ring[i]);
+ __m128i ptrs_hi = _mm_loadu_si128((const __m128i *)&sw_ring[i +
2]);
+ _mm_storeu_si128((__m128i *)&rx_pkts[i], ptrs_lo);
+ _mm_storeu_si128((__m128i *)&rx_pkts[i + 2], ptrs_hi);
+#else
+ for (int j = 0; j < IDPF_VPMD_DESCS_PER_LOOP; ++j)
+ rx_pkts[i + j] = sw_ring[i + j];
+#endif
Why not just a single load/store? I guess compiler should optimize this
anyway, it just looks odd. Also, I think the comment only applies to
64-bit path, so probably should be made more generic (e.g. "copy 4 mbuf
pointers into rx_pkts[]" without mentioning how long they are).
--
Thanks,
Anatoly