On Tue, 14 Feb 2017 11:06:25 -0800
Alexander Duyck <alexander.du...@gmail.com> wrote:

> On Tue, Feb 14, 2017 at 10:46 AM, Jesper Dangaard Brouer
> <bro...@redhat.com> wrote:
> > On Tue, 14 Feb 2017 09:29:54 -0800
> > Alexander Duyck <alexander.du...@gmail.com> wrote:
> >  
> >> On Tue, Feb 14, 2017 at 6:56 AM, Tariq Toukan <ttoukan.li...@gmail.com> 
> >> wrote:  
> >> >
> >> >
> >> > On 14/02/2017 3:45 PM, Eric Dumazet wrote:  
> >> >>
> >> >> On Tue, Feb 14, 2017 at 4:12 AM, Jesper Dangaard Brouer
> >> >> <bro...@redhat.com> wrote:
> >> >>  
> >> >>> It is important to understand that there are two cases for the cost of
> >> >>> an atomic op, which depend on the cache-coherency state of the
> >> >>> cacheline.
> >> >>>
> >> >>> Measured on Skylake CPU i7-6700K CPU @ 4.00GHz
> >> >>>
> >> >>> (1) Local CPU atomic op :  27 cycles(tsc)  6.776 ns
> >> >>> (2) Remote CPU atomic op: 260 cycles(tsc) 64.964 ns
> >> >>>  
> >> >> Okay, it seems you guys really want a patch that I said was not giving
> >> >> good results
> >> >>
> >> >> Let me publish the numbers I get , adding or not the last (and not
> >> >> official) patch.
> >> >>
> >> >> If I _force_ the user space process to run on the other node,
> >> >> then the results are not the ones Alex or you are expecting.
> >> >>
> >> >> I have with this patch about 2.7 Mpps of this silly single TCP flow,
> >> >> and 3.5 Mpps without it.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10 | grep eth0 | grep Ave
> >> >> Average:         eth0 2699243.20  16663.70 1354783.36   1079.95
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profile of the cpu on NUMA node 1 ( netserver consuming data ) :
> >> >>
> >> >>      54.73%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>      31.07%  [kernel]      [k] skb_release_data
> >> >>       4.24%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.35%  [kernel]      [k] copy_page_to_iter
> >> >>       0.98%  [kernel]      [k] _raw_spin_lock
> >> >>       0.90%  [kernel]      [k] skb_release_head_state
> >> >>       0.60%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.51%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.33%  [kernel]      [k] ___cache_free
> >> >>       0.28%  [kernel]      [k] tcp_rcv_established
> >> >>
> >> >> Profile of cpu handling mlx4 softirqs (NUMA node 0)
> >> >>
> >> >>
> >> >>      48.00%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      12.92%  [kernel]          [k] napi_gro_frags
> >> >>       7.28%  [kernel]          [k] inet_gro_receive
> >> >>       7.17%  [kernel]          [k] tcp_gro_receive
> >> >>       5.10%  [kernel]          [k] dev_gro_receive
> >> >>       4.87%  [kernel]          [k] skb_gro_receive
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       2.04%  [kernel]          [k] __build_skb
> >> >>       1.02%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       1.01%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.65%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.45%  [kernel]          [k] _raw_spin_lock
> >> >>
> >> >> Without the latest  patch (the exact patch series v3 I submitted),
> >> >> thus with this atomic_inc() in mlx4_en_process_rx_cq  instead of only
> >> >> reads.
> >> >>
> >> >> lpaa24:~# sar -n DEV 1 10|grep eth0|grep Ave
> >> >> Average:         eth0 3566768.50  25638.60 1790345.69   1663.51
> >> >> 0.00      0.00      4.50
> >> >>
> >> >> Profiles of the two cpus :
> >> >>
> >> >>      74.85%  [kernel]      [k] copy_user_enhanced_fast_string
> >> >>       6.42%  [kernel]      [k] skb_release_data
> >> >>       5.65%  [kernel]      [k] skb_copy_datagram_iter
> >> >>       1.83%  [kernel]      [k] copy_page_to_iter
> >> >>       1.59%  [kernel]      [k] _raw_spin_lock
> >> >>       1.48%  [kernel]      [k] skb_release_head_state
> >> >>       0.72%  [kernel]      [k] tcp_transmit_skb
> >> >>       0.68%  [kernel]      [k] mlx4_en_xmit
> >> >>       0.43%  [kernel]      [k] page_frag_free
> >> >>       0.38%  [kernel]      [k] ___cache_free
> >> >>       0.37%  [kernel]      [k] tcp_established_options
> >> >>       0.37%  [kernel]      [k] __ip_local_out
> >> >>
> >> >>
> >> >>     37.98%  [kernel]          [k] mlx4_en_process_rx_cq
> >> >>      26.47%  [kernel]          [k] napi_gro_frags
> >> >>       7.02%  [kernel]          [k] inet_gro_receive
> >> >>       5.89%  [kernel]          [k] tcp_gro_receive
> >> >>       5.17%  [kernel]          [k] dev_gro_receive
> >> >>       4.80%  [kernel]          [k] skb_gro_receive
> >> >>       2.61%  [kernel]          [k] __build_skb
> >> >>       2.45%  [kernel]          [k] mlx4_en_prepare_rx_desc
> >> >>       1.59%  [kernel]          [k] napi_reuse_skb.isra.95
> >> >>       0.95%  [kernel]          [k] tcp4_gro_receive
> >> >>       0.51%  [kernel]          [k] kmem_cache_alloc
> >> >>       0.42%  [kernel]          [k] __inet_lookup_established
> >> >>       0.34%  [kernel]          [k] swiotlb_sync_single_for_cpu
> >> >>
> >> >>
> >> >> So probably this will need further analysis, outside of the scope of
> >> >> this patch series.
> >> >>
> >> >> Could we now please Ack this v3 and merge it ?
> >> >>
> >> >> Thanks.  
> >> >
> >> > Thanks Eric.
> >> >
> >> > As the previous series caused hangs, we must run functional regression 
> >> > tests
> >> > over this series as well.
> >> > Run has already started, and results will be available tomorrow morning.
> >> >
> >> > In general, I really like this series. The re-factorization looks more
> >> > elegant and more correct, functionally.
> >> >
> >> > However, performance wise: we fear that the numbers will be drastically
> >> > lower with this transition to order-0 pages,
> >> > because of the (becoming critical) page allocator and dma operations
> >> > bottlenecks, especially on systems with costly
> >> > dma operations, such as ARM, iommu=on, etc...  
> >>
> >> So to give you an idea I originally came up with the approach used in
> >> the Intel drivers when I was dealing with a PowerPC system where the
> >> IOMMU was a requirement.  With this setup correctly the map/unmap
> >> calls should be almost non-existent.  Basically the only time we
> >> should have to allocate or free a page is if something is sitting on
> >> the pages for an excessively long time or if the interrupt is bouncing
> >> between memory nodes which forces us to free the memory since it is
> >> sitting on the wrong node.
> >>  
> >> > We already have this exact issue in mlx5, where we moved to order-0
> >> > allocations with a fixed size cache, but that was not enough.
> >> > Customers of mlx5 have already complained about the performance 
> >> > degradation,
> >> > and currently this is hurting our business.
> >> > We get a clear nack from our performance regression team regarding doing 
> >> > the
> >> > same in mlx4.  
> >>
> >> What form of recycling were you doing?  If you were doing the offset
> >> based setup then that obviously only gets you so much.  The advantage
> >> to using the page count is that you get almost a mobius strip effect
> >> for your buffers and descriptor rings where you just keep flipping the
> >> page offset back and forth via an XOR and the only cost is
> >> dma_sync_for_cpu, get_page, dma_sync_for_device instead of having to
> >> do the full allocation, mapping, and unmapping.
> >>  
> >> > So, the question is, can we live with this degradation until those
> >> > bottleneck challenges are addressed?
> >> > Following our perf experts feedback, I cannot just simply Ack. We need to
> >> > have a clear plan to close the perf gap or reduce the impact.  
> >>
> >> I think we need to define what is the degradation you are expecting to
> >> see.  Using the page count based approach should actually be better in
> >> some cases than the order 3 page and just walking frags approach since
> >> the theoretical reuse is infinite instead of fixed.
> >>  
> >> > Internally, I already implemented "dynamic page-cache" and "page-reuse"
> >> > mechanisms in the driver,
> >> > and together they totally bridge the performance gap.
> >> > That's why I would like to hear from Jesper what is the status of his
> >> > page_pool API, it is promising and could totally solve these issues.
> >> >
> >> > Regards,
> >> > Tariq  
> >>
> >> The page pool may provide gains but we have to actually see it before
> >> we can guarantee it.  If worse comes to worse I might just resort to
> >> standardizing the logic used for the Intel driver page count based
> >> approach.  Then if nothing else we would have a standard path for all
> >> the drivers to use if we start going down this route.  
> >
> > With this Intel driver page count based recycle approach, the recycle
> > size is tied to the size of the RX ring.  As Eric and Tariq discovered.
> > And for other performance reasons (memory footprint of walking RX ring
> > data-structures), don't want to increase the RX ring sizes.  Thus, it
> > create two opposite performance needs.  That is why I think a more
> > explicit approach with a pool is more attractive.
> >
> > How is this approach doing to work for XDP?
> > (XDP doesn't "share" the page, and in-general we don't want the extra
> > atomic.)  
> 
> The extra atomic is moot since it is we can get rid of that by doing
> bulk page count updates.
> 
> Why can't XDP share a page?  You have brought up the user space aspect
> of things repeatedly but the AF_PACKET patches John did more or less
> demonstrated that wasn't the case.  What you end up with is you have
> to either be providing pure user space pages or pure kernel pages.
> You can't have pages being swapped between the two without introducing
> security issues.  So even with your pool approach it doesn't matter
> which way things are run.  Your pool is either device to user space,
> or device to kernel space and it can't be both without creating
> sharing concerns.

(I think we are talking past each other here.)

 
> > We absolutely need recycling with XDP, when transmitting out another
> > device, and the other devices DMA-TX completion need some way of
> > returning this page.  
> 
> I fully agree here.  However the easiest way of handling this is via
> the page count in my opinion.
> 
> > What is basically needed is a standardized callback to allow the remote
> > driver to return the page to the originating driver.  As we don't have
> > a NDP for XDP-forward/transmit yet, we could pass this callback as a
> > parameter along with the packet-page to send?  
> 
> No.  This assumes way too much.  Most packets aren't going to be
> device to device routing.  We have seen this type of thing and
> rejected it multiple times.  Don't think driver to driver.  This is
> driver to network stack, socket, device, virtual machine, storage,
> etc.  The fact is there are many spots where a frame might get
> terminated. [...]

I fully agree, that XDP need to think further than driver to driver.

 
> This is why I said before what we need to do is have a page destructor
> to handle this sort of thing.  The idea is you want to have this work
> everywhere.  Having just drivers do this would make it a nice toy but
> completely useless since not too many people are doing
> routing/bridging between interfaces.  Using the page destructor it is
> easy to create a pool of "free" pages that you can then pull your DMA
> pages out of or that you can release back into the page allocator.

How is this page destructor different from my page_pool RFC
implementation?  (It basically functions as a page destructor...)

Help me understand what I'm missing?

Pointers to page_pool discussions
* page_pool RFC patchset:
 - http://lkml.kernel.org/r/20161220132444.18788.50875.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132812.18788.20431.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132817.18788.64726.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132822.18788.19768.stgit@firesoul
 - http://lkml.kernel.org/r/20161220132827.18788.8658.stgit@firesoul

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Reply via email to