On Tue, 13 Sep 2016 13:16:29 +0300
Tariq Toukan <tar...@mellanox.com> wrote:

> On 07/09/2016 9:45 PM, Jesper Dangaard Brouer wrote:
> > On Wed,  7 Sep 2016 15:42:24 +0300 Saeed Mahameed <sae...@mellanox.com> 
> > wrote:
> >  
> >> From: Tariq Toukan <tar...@mellanox.com>
> >>
> >> Instead of reallocating and mapping pages for RX data-path,
> >> recycle already used pages in a per ring cache.
> >>
> >> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
> >>
> >> Single stride, 64 bytes:
> >> * 4,739,057 - baseline
> >> * 4,749,550 - order0 no cache
> >> * 4,786,899 - order0 with cache
> >> 1% gain
> >>
> >> Larger packets, no page cross, 1024 bytes:
> >> * 3,982,361 - baseline
> >> * 3,845,682 - order0 no cache
> >> * 4,127,852 - order0 with cache
> >> 3.7% gain
> >>
> >> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> >> * 3,731,189 - baseline
> >> * 3,579,414 - order0 no cache
> >> * 3,931,708 - order0 with cache
> >> 5.4% gain
> >>
> >> Signed-off-by: Tariq Toukan <tar...@mellanox.com>
> >> Signed-off-by: Saeed Mahameed <sae...@mellanox.com>
> >> ---
> >>   drivers/net/ethernet/mellanox/mlx5/core/en.h       | 16 ++++++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 15 ++++++
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c    | 57 
> >> ++++++++++++++++++++--
> >>   drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
> >>   4 files changed, 99 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h 
> >> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> index 075cdfc..afbdf70 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> @@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
> >>    u8                                      tired;
> >>   };
> >>   
> >> +/* a single cache unit is capable to serve one napi call (for 
> >> non-striding rq)
> >> + * or a MPWQE (for striding rq).
> >> + */
> >> +#define MLX5E_CACHE_UNIT  (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
> >> +                           MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
> >> +#define MLX5E_CACHE_SIZE  (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
> >> +struct mlx5e_page_cache {
> >> +  u32 head;
> >> +  u32 tail;
> >> +  struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
> >> +};
> >> +  
> > [...]  
> >>   
> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
> >> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> index c1cb510..8e02af3 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> >> @@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct 
> >> mlx5e_rq *rq, u16 ix)
> >>    mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> >>   }
> >>   
> >> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> >> +                                struct mlx5e_dma_info *dma_info)
> >> +{
> >> +  struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +  u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> >> +
> >> +  if (tail_next == cache->head) {
> >> +          rq->stats.cache_full++;
> >> +          return false;
> >> +  }
> >> +
> >> +  cache->page_cache[cache->tail] = *dma_info;
> >> +  cache->tail = tail_next;
> >> +  return true;
> >> +}
> >> +
> >> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> >> +                                struct mlx5e_dma_info *dma_info)
> >> +{
> >> +  struct mlx5e_page_cache *cache = &rq->page_cache;
> >> +
> >> +  if (unlikely(cache->head == cache->tail)) {
> >> +          rq->stats.cache_empty++;
> >> +          return false;
> >> +  }
> >> +
> >> +  if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> >> +          rq->stats.cache_busy++;
> >> +          return false;
> >> +  }  
> > Hmmm... doesn't this cause "blocking" of the page_cache recycle
> > facility until the page at the head of the queue gets (page) refcnt
> > decremented?  Real use-case could fairly easily block/cause this...  
> Hi Jesper,
> 
> That's right. We are aware of this issue.
> We considered ways of solving this, but decided to keep current 
> implementation for now.
> One way of solving this is to look deeper in the cache.
> Cons:
> - this will consume time, and the chance of finding an available page is 
> not that high: if the page in head of queue is busy then there's a good 
> chance that all the others are too (because of FIFO).
> in other words, you already checked all pages and anyway you're going to 
> allocate a new one (higher penalty for same decision).
> - this will make holes in the array causing complex accounting when 
> looking for an available page (this can easily be fixed by swapping 
> between the page in head and the available one).
> 
> Another way is sharing pages between different RQs.
> - For now we're not doing this for simplicity and to keep 
> synchronization away.
> 
> What do you think?
> 
> Anyway, we're looking forward to use your page-pool API which solves 
> these issues.

Yes, as you mention yourself, the page-pool API solves this problem.
Thus, I'm not sure it is worth investing more time in optimizing this
driver local page cache mechanism.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer
_______________________________________________
iovisor-dev mailing list
iovisor-dev@lists.iovisor.org
https://lists.iovisor.org/mailman/listinfo/iovisor-dev

Reply via email to