I have an app which captures packets on a single core and then passes to multiple workers on different lcores, using the ring queues.
While I manage to capture packets at 10Gbps, when I send it to the processing lcores there is substantial packet loss. At first I figured it's the processing I do on the packets and optimized that, which did help it a little but did not alleviate the problem. I used Intel VTune amplifier to profile the program, and on all profiling checks that I did there, the majority of the time in the program is spent in "__rte_ring_sc_do_dequeue" (about 70%). I was wondering if anyone can tell me how to optimize this, or if I'm using the queues incorrectly, or maybe even doing the profiling wrong (because I do find it weird that this dequeuing is so slow). My program architecture is as follows (replaced consts with actual values): A queue is created for each processing lcore: rte_ring_create(qname, swsize, NUMA_SOCKET, 1024*1024, RING_F_SP_ENQ | RING_F_SC_DEQ); The processing core enqueues packets one by one, to each of the queues (the packet burst size is 256): rte_ring_sp_enqueue(lc[queue_index].queue, (void *const)pkts[i]); Which are then dequeued in bulk in the processor lcores: rte_ring_sc_dequeue_bulk(lc->queue, (void**) &mbufs, 128); I'm using 16 1GB hugepages, running the new 2.0 version. If there's any further info required about the program, let me know. Thank you.