On Tue, Apr 05, 2016 at 11:29:05AM +0200, Jesper Dangaard Brouer wrote:
> > 
> > Of course, there are other pieces to accelerate:
> >  12.71%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_alloc_frags
> >   6.87%  ksoftirqd/1    [mlx4_en]         [k] mlx4_en_free_frag
> >   4.20%  ksoftirqd/1    [kernel.vmlinux]  [k] get_page_from_freelist
> >   4.09%  swapper        [mlx4_en]         [k] mlx4_en_process_rx_cq
> > and I think Jesper's work on batch allocation is going help that a lot.
> 
> Actually, it looks like all of this "overhead" comes from the page
> alloc/free (+ dma unmap/map). We would need a page-pool recycle
> mechanism to solve/remove this overhead.  For the early drop case we
> might be able to hack recycle the page directly in the driver (and also
> avoid dma_unmap/map cycle).

Exactly. A cache of allocated and mapped pages will help a lot both drop
and redirect use cases. After tx completion we can recycle still mmaped
page into the cache (need to make sure to map them PCI_DMA_BIDIRECTIONAL)
and rx can refill the ring with it. For load balancer steady state
we won't have any calls to page allocator and dma.
Being able to do cheap percpu pool like this is a huge advantage
that any kernel bypass cannot have. I'm pretty sure it will be
possible to avoid local_cmpxchg as well.

Reply via email to