On Wed, Oct 30, 2024 at 04:40:10PM +0100, Lukáš Šišmiš wrote:
> 
> On 30. 10. 24 16:20, Stephen Hemminger wrote:
> > On Wed, 30 Oct 2024 14:58:40 +0100
> > Lukáš Šišmiš <sis...@cesnet.cz> wrote:
> > 
> > > On 29. 10. 24 15:37, Morten Brørup wrote:
> > > > > From: Lukas Sismis [mailto:sis...@cesnet.cz]
> > > > > Sent: Tuesday, 29 October 2024 13.49
> > > > > 
> > > > > Intel PMDs are capped by default to only 4096 RX/TX descriptors.
> > > > > This can be limiting for applications requiring a bigger buffer
> > > > > capabilities. The cap prevented the applications to configure
> > > > > more descriptors. By bufferring more packets with RX/TX
> > > > > descriptors, the applications can better handle the processing
> > > > > peaks.
> > > > > 
> > > > > Signed-off-by: Lukas Sismis <sis...@cesnet.cz>
> > > > > ---
> > > > Seems like a good idea.
> > > > 
> > > > Have the max number of descriptors been checked with the datasheets for 
> > > > all the affected NIC chips?
> > > I was hoping to get some feedback on this from the Intel folks.
> > > 
> > > But it seems like I can change it only for ixgbe (82599) to 32k
> > > (possibly to 64k - 8), others - ice (E810) and i40e (X710) are capped at
> > > 8k - 32.
> > > 
> > > I neither have any experience with other drivers nor I have them
> > > available to test so I will let it be in the follow-up version of this
> > > patch.
> > > 
> > > Lukas
> > > 
> > Having large number of descriptors especially at lower speeds will
> > increase buffer bloat. For real life applications, do not want increase
> > latency more than 1ms.
> > 
> > 10 Gbps has 7.62Gbps of effective bandwidth due to overhead.
> > Rate for 1500 MTU is 7.62Gbs / (1500 * 8) = 635 K pps (i.e 1.5 us per 
> > packet)
> > A ring of 4096 descriptors can take 6 ms for full size packets.
> > 
> > Be careful, optimizing for 64 byte benchmarks can be disaster in real world.
> > 
> Thanks for the info Stephen, however I am not trying to optimize for 64 byte
> benchmarks. The work has been initiated by an IO problem and Intel NICs.
> Suricata IDS worker (1 core per queue) received a burst of packets and then
> sequentially processes them one by one. Well it seems like having a 4k
> buffers it seems to not be enough. NVIDIA NICs allow e.g. 32k descriptors
> and it works fine. In the end it worked fine when ixgbe descriptors were
> increased as well. I am not sure why AF-Packet can handle this much better
> than DPDK, AFP doesn't have crazy high number of descriptors configured <=
> 4096, yet it works better. At the moment I assume there is an internal
> buffering in the kernel which allows to handle processing spikes.
> 
> To give more context here is the forum discussion - 
> https://forum.suricata.io/t/high-packet-drop-rate-with-dpdk-compared-to-af-packet-in-suricata-7-0-7/4896
> 

Thanks for the context, and it is an interesting discussion.

One small suggestion, which I sadly don't think it will help with your
problem specifically, but I suspect that you don't need both Rx and Tx
queues to be that big. Given that the traffic going out is not going to be
greater than the traffic rate coming in, you shouldn't need much buffering,
on the Tx side. Therefore, even if you increase the Rx buffers to 32k, I'd
suggest using only 1k or 512 Tx ring slots and see how it goes. That will
give you better performance due to a reduced memory buffer footprint.  Any
packets buffers transmitted will remain in the NIC ring until SW wraps all
the way around the ring, meaning a 4k Tx ring will likely always hold 4k-64
buffers in it, and similarly a 32k Tx ring will increase your active buffer
count (and hence app cache footprint) by 32k-64.

/Bruce

Reply via email to