Hi Alex, I also think existing network stack overhead is too much for high BW network. However, I would thought that with system resource we have, Intel card should handle 1 x 10Gbps traffic regardless what type of traffic that is right out of the box. Thing I don't understand is that you kept saying one CPU is working on all the traffic, then please tell me how to make multiple CPUs working on it. I thought Don already verified RSS does spread traffic to 8 queues.
Hank On Thu, Sep 8, 2016 at 12:44 PM, Alexander Duyck <alexander.du...@gmail.com> wrote: > Hi Hank, > > So I don't think you quite understand how the protocol stack works. > The driver operates in softirq context. Within that context we handle > everything up to and including copying the incoming packet into the > socket buffer. So when I say it is the socket layer having issues, > what I mean is that the kernel networking stack is likely consuming a > considerable amount of overhead and resulting in us not being able to > process packets fast enough. If you are wanting to try and determine > the root cause you may want to familiarize yourself with the "perf' > utility. With that you can start to break down what is using largest > percentages of your CPU time and you would likely be able to narrow > things down more. > > At this point I don't think there is too much more we can do at the > driver level. It seems like if your issue requires you to be able to > receive multicast UDP at full line rate on a single CPU then you might > be better served by familiarizing yourself with the kernel stack > itself and tuning it. > > - Alex > > On Thu, Sep 8, 2016 at 12:14 PM, Hank Liu <hank.tz...@gmail.com> wrote: > > Hi Alex, > > > > I appreciate your input. As you pointed out that we would expect single > > queue and single xeon cpu core should be able to handle say 5-6G. Then, > if > > RSS come in place, we should easily handle close to 10G. However, with 8 > or > > 16 queues, I still get pause frame. Without any basic problems - wrong > pci > > setup, wrong interrupt mechanism. I can see the overhead in network stack > > with a lot of socket connection. However, unless the buffer is hold by > upper > > layer protocol, mac driver should not be tied up. At most, it will drop > on > > protocol layer. That is the puzzle to solve. > > > > I mentioned about rx ring size more or so is to see if that will trigger > > some thought on that direction. I know you will still run into problem > just > > matter of time only. > > > > About traffic pattern, I have no control what it could be. Customer could > > setup whatever they think it makes sense to them. So, I have to prepare > the > > worst case, just like cache missed, we need to handle that anyway. > > > > Hank > > > > On Thu, Sep 8, 2016 at 10:16 AM, Alexander Duyck < > alexander.du...@gmail.com> > > wrote: > >> > >> So the lspci looks good. It looks like everything is optimal there. > >> From what I can tell it looks like the NICs are in slots associated > >> with NUMA node 0. > >> > >> So my thought on all this is that what is likely limiting your > >> throughput is most likely the packet processing overhead associated > >> with the fact that your frames are around 1300 bytes in length, and > >> the fact that the addresses are multicast. Normally I would expect a > >> single queue/thread to process somewhere around 8GB/s optimally > >> configured. The fact is the kernel itself normally can't handle much > >> more than that without disabling features such as iptables and such. > >> > >> It seems like your workload isn't scaling when you add additional CPUs > >> as almost all of your traffic is being delivered on just a small set > >> of queues. In order for RSS to be able to spread out traffic there > >> needs to be enough differences between the flows. From what I can > >> tell it looks like the variance in multicast address is not great > >> enough to make a substantial difference in where the flows are sent as > >> only 8 out of the 16 available queues are being used. It might be > >> useful if you could post a packet capture showing a slice of few > >> thousand frames, if I am not mistaken that should be about 2MB or 3MB. > >> The general idea is to get an idea of what the flow looks like, if it > >> is a serialized flow where it is bursting all of the traffic for one > >> flow at a time, or if they are well interleaved. Also it would tell > >> me why we aren't seeing a good distribution as enabling UDP rss via > >> "ethtool -N <iface> rx-flow-hash udp4 sdfn" should have given us a > >> good spread of the traffic and from the sound of things we just aren't > >> seeing that.. > >> > >> Finally adding additional buffering by increasing the ring size to > >> 8192 wouldn't provide any additional throughput, if anything it would > >> just slow things down more. For more information on the effect you > >> could search the internet for the term "buffer bloat". One thing you > >> might try is reducing the Rx ring size to 256 buffers instead of 512. > >> Sometimes that can provide a small bit of improvement as it reduces > >> the descriptor ring size to 1 4K page instead of using 8K as it > >> normally does. When multiple rings are active simultaneously this can > >> reduce the cache footprint which in turn can improve cache utilization > >> as you are less likely to evict data out of the L3 cache that was > >> placed there by DDIO. > >> > >> I hope you find some of this information useful. > >> > >> - Alex > >> > >> On Thu, Sep 8, 2016 at 8:58 AM, Hank Liu <hank.tz...@gmail.com> wrote: > >> > Hi Alex, > >> > > >> > see attached. thanks! > >> > > >> > Hank > >> > > >> > On Wed, Sep 7, 2016 at 7:32 PM, Alexander Duyck > >> > <alexander.du...@gmail.com> > >> > wrote: > >> >> > >> >> Can you send me an lspci -vvv dump for the card. The main piece I am > >> >> interested in seeing is the link status register output. I just want > >> >> to verify that you are linked at x8 gen2. > >> >> > >> >> - Alex > >> >> > >> >> On Wed, Sep 7, 2016 at 4:00 PM, Hank Liu <hank.tz...@gmail.com> > wrote: > >> >> > Nope, no help. Still seeing pause frames or rx_no_dma_resource when > >> >> > BW > >> >> > is up > >> >> > to 8 Gbps... > >> >> > > >> >> > On Wed, Sep 7, 2016 at 3:42 PM, Hank Liu <hank.tz...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> HI Alexander, > >> >> >> > >> >> >> Thanks for your input. Will give it a try. > >> >> >> > >> >> >> > >> >> >> Hank > >> >> >> > >> >> >> On Wed, Sep 7, 2016 at 3:23 PM, Alexander Duyck > >> >> >> <alexander.du...@gmail.com> wrote: > >> >> >>> > >> >> >>> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D > >> >> >>> <mark.d.rus...@intel.com> > >> >> >>> wrote: > >> >> >>> > Hank Liu <hank.tz...@gmail.com> wrote: > >> >> >>> > > >> >> >>> >>> *From:* Hank Liu [mailto:hank.tz...@gmail.com] > >> >> >>> >>> *Sent:* Wednesday, September 07, 2016 10:20 AM > >> >> >>> >>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> > >> >> >>> >>> *Cc:* e1000-devel@lists.sourceforge.net > >> >> >>> >>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards > >> >> >>> >>> for > >> >> >>> >>> 10G > >> >> >>> >>> SFPs > >> >> >>> >>> UDP performance issue > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> Thanks for quick response and helping. I guess I didn't make > it > >> >> >>> >>> clear > >> >> >>> >>> is > >> >> >>> >>> that the application (receiver, sender) open 240 connections > >> >> >>> >>> each > >> >> >>> >>> connection has 34 Mbps traffic. > >> >> >>> > > >> >> >>> > > >> >> >>> > You say that there are 240 connections, but how many threads is > >> >> >>> > your > >> >> >>> > app > >> >> >>> > using? One per connection? What does the cpu utilization look > >> >> >>> > like > >> >> >>> > on > >> >> >>> > the > >> >> >>> > receiving end? > >> >> >>> > > >> >> >>> > Also, the current ATR implementation does not support UDP, so > you > >> >> >>> > are > >> >> >>> > probably better off not pinning the app threads at all and > >> >> >>> > trusting > >> >> >>> > that the > >> >> >>> > scheduler will migrate them to the cpu that is getting their > >> >> >>> > packets > >> >> >>> > via > >> >> >>> > RSS. You should still set the affinity of the interrupts in > that > >> >> >>> > case. > >> >> >>> > The > >> >> >>> > default number of queues should be fine. > >> >> >>> > >> >> >>> If you are running point to point with UDP traffic and are not > >> >> >>> fragmenting packets I would recommend enabling RSS for UDP flows. > >> >> >>> You > >> >> >>> can do that via the following command: > >> >> >>> ethtool -N <interface> rx-flow-hash udp4 sdfn > >> >> >>> > >> >> >>> That should allow the work to spread to more queues than just the > >> >> >>> one > >> >> >>> that is currently being selected based on your source and > >> >> >>> destination > >> >> >>> IP addresses. > >> >> >>> > >> >> >>> - Alex > >> >> >> > >> >> >> > >> >> > > >> > > >> > > > > > >
------------------------------------------------------------------------------
_______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired