On Mon, Mar 13, 2017 at 1:23 PM, Alexei Starovoitov <alexei.starovoi...@gmail.com> wrote: > On Mon, Mar 13, 2017 at 11:58:05AM -0700, Eric Dumazet wrote: >> On Mon, Mar 13, 2017 at 11:31 AM, Alexei Starovoitov >> <alexei.starovoi...@gmail.com> wrote: >> > On Mon, Mar 13, 2017 at 10:50:28AM -0700, Eric Dumazet wrote: >> >> On Mon, Mar 13, 2017 at 10:34 AM, Alexei Starovoitov >> >> <alexei.starovoi...@gmail.com> wrote: >> >> > On Sun, Mar 12, 2017 at 05:58:47PM -0700, Eric Dumazet wrote: >> >> >> @@ -767,10 +814,30 @@ int mlx4_en_process_rx_cq(struct net_device >> >> >> *dev, struct mlx4_en_cq *cq, int bud >> >> >> case XDP_PASS: >> >> >> break; >> >> >> case XDP_TX: >> >> >> + /* Make sure we have one page ready to >> >> >> replace this one */ >> >> >> + npage = NULL; >> >> >> + if (!ring->page_cache.index) { >> >> >> + npage = mlx4_alloc_page(priv, >> >> >> ring, >> >> >> + &ndma, >> >> >> numa_mem_id(), >> >> >> + >> >> >> GFP_ATOMIC | __GFP_MEMALLOC); >> >> > >> >> > did you test this with xdp2 test ? >> >> > under what conditions it allocates ? >> >> > It looks dangerous from security point of view to do allocations here. >> >> > Can it be exploited by an attacker? >> >> > we use xdp for ddos and lb and this is fast path. >> >> > If 1 out of 100s XDP_TX packets hit this allocation we will have serious >> >> > perf regression. >> >> > In general I dont think it's a good idea to penalize x86 in favor of >> >> > powerpc. >> >> > Can you #ifdef this new code somehow? so we won't have these concerns >> >> > on x86? >> >> >> >> Normal paths would never hit this point really. I wanted to be extra >> >> safe, because who knows, some guys could be tempted to set >> >> ethtool -G ethX rx 512 tx 8192 >> >> >> >> Before this patch, if you were able to push enough frames in TX ring, >> >> you would also eventually be forced to allocate memory, or drop frames... >> > >> > hmm. not following. >> > Into xdp tx queues packets don't come from stack. It can only be via >> > xdp_tx. >> > So this rx page belongs to driver, not shared with anyone and it only >> > needs to >> > be put onto tx ring, so I don't understand why driver needs to allocating >> > anything here. To refill the rx ring? but why here? >> >> Because RX ring can be shared, by packets goind to the upper stacks (say TCP) >> >> So there is no guarantee that the pages in the quarantine pool have >> their page count to 1. >> >> The normal TX_XDP path will recycle pages in ring->cache_page . >> >> This is exactly where I pick up a replacement. >> >> Pages in ring->cache_page have the guarantee to have no other users >> than ourself (mlx4 driver) >> >> You might have not noticed that current mlx4 driver has a lazy refill >> of RX ring buffers, that eventually >> removes all the pages from RX ring, and we have to recover with this >> lazy mlx4_en_recover_from_oom() thing >> that will attempt to restart the allocations. >> >> After my patch, we have the guarantee that the RX ring buffer is >> always fully populated. >> >> When we receive a frame (XDP or not), we drop it if we can not >> replenish the RX slot, >> in case the oldest page in quarantine is not a recycling candidate and >> we can not allocate a new page. > > Got it. Could you please add above explanation to the commit log, > since it's a huge difference vs other drivers. > I don't think any other driver in the tree follows this strategy.
sk_buff allocation can fail before considering adding a frag on the skb anyway... Look, all this is completely irrelevant for XDP_TX, since there is no allocation at all, once ~128 pages have been put into page_cache. Do you want to prefill this cache when XDP is loaded ? > I think that's the right call, but it shouldn't be hidden in details. ring->page_cache contains the pages used by XDP_TX are recycled through mlx4_en_rx_recycle() There is a nice comment there. I did not change this part. These pages _used_ to be directly taken from mlx4_en_prepare_rx_desc(), because there was no page recycling for the normal path. All my patch does is a generic implementation for page recycling, leaving the XDP_TX pages in their own pool, because we do not even have to check their page count, adding a cache line miss. So I do have 2 different pools, simply to let XDP_TX path being ultra fast, as before. Does not look details to me. > If that's the right approach (probably is) we should do it > in the other drivers too. > Though I don't see you dropping "to the stack" packet in this patch. > The idea is to drop the packet (whether xdp or not) if rx > buffer cannot be replinshed _regardless_ whether driver > implements recycling, right? > > Theory vs practice. We don't see this issue running xdp. Sure, so lets ignore syzkaller guys and stop doing code reviews, if all this is working for you. > My understanding, since rx and tx napi are scheduled for the same cpu > there is a guarantee that during normal invocation they won't race > and the only problem is due to mlx4_en_recover_from_oom() that can > run rx napi on some random cpu. Exactly. So under pressure, host will panic :/ Same if you let one application using busy polling, or if IRQ is moved to another cpu (/proc/irq/*/smp_affinity) Looks fragile, but whatever, maybe you control everything on your hosts.