Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP performance issue

Hank Liu Thu, 08 Sep 2016 14:19:29 -0700

Hi Alex,

I also think existing network stack overhead is too much for high BW
network. However, I would thought that with system resource we have, Intel
card should handle 1 x 10Gbps traffic regardless what type of traffic that
is right out of the box. Thing I don't understand is that you kept saying
one CPU is working on all the traffic, then please tell me how to make
multiple CPUs working on it. I thought Don already verified RSS does spread
traffic to 8 queues.




Hank





On Thu, Sep 8, 2016 at 12:44 PM, Alexander Duyck <alexander.du...@gmail.com>
wrote:

> Hi Hank,
>
> So I don't think you quite understand how the protocol stack works.
> The driver operates in softirq context.  Within that context we handle
> everything up to and including copying the incoming packet into the
> socket buffer.  So when I say it is the socket layer having issues,
> what I mean is that the kernel networking stack is likely consuming a
> considerable amount of overhead and resulting in us not being able to
> process packets fast enough.  If you are wanting to try and determine
> the root cause you may want to familiarize yourself with the "perf'
> utility.  With that you can start to break down what is using largest
> percentages of your CPU time and you would likely be able to narrow
> things down more.
>
> At this point I don't think there is too much more we can do at the
> driver level.  It seems like if your issue requires you to be able to
> receive multicast UDP at full line rate on a single CPU then you might
> be better served by familiarizing yourself with the kernel stack
> itself and tuning it.
>
> - Alex
>
> On Thu, Sep 8, 2016 at 12:14 PM, Hank Liu <hank.tz...@gmail.com> wrote:
> > Hi Alex,
> >
> > I appreciate your input. As you pointed out that we would expect single
> > queue and single xeon cpu core should be able to handle say 5-6G. Then,
> if
> > RSS come in place, we should easily handle close to 10G. However, with 8
> or
> > 16 queues, I still get pause frame. Without any basic problems - wrong
> pci
> > setup, wrong interrupt mechanism. I can see the overhead in network stack
> > with a lot of socket connection. However, unless the buffer is hold by
> upper
> > layer protocol, mac driver should not be tied up. At most, it will drop
> on
> > protocol layer. That is the puzzle to solve.
> >
> > I mentioned about rx ring size more or so is to see if that will trigger
> > some thought on that direction. I know you will still run into problem
> just
> > matter of time only.
> >
> > About traffic pattern, I have no control what it could be. Customer could
> > setup whatever they think it makes sense to them. So, I have to prepare
> the
> > worst case, just like cache missed, we need to handle that anyway.
> >
> > Hank
> >
> > On Thu, Sep 8, 2016 at 10:16 AM, Alexander Duyck <
> alexander.du...@gmail.com>
> > wrote:
> >>
> >> So the lspci looks good.  It looks like everything is optimal there.
> >> From what I can tell it looks like the NICs are in slots associated
> >> with NUMA node 0.
> >>
> >> So my thought on all this is that what is likely limiting your
> >> throughput is most likely the packet processing overhead associated
> >> with the fact that your frames are around 1300 bytes in length, and
> >> the fact that the addresses are multicast.  Normally I would expect a
> >> single queue/thread to process somewhere around 8GB/s optimally
> >> configured.  The fact is the kernel itself normally can't handle much
> >> more than that without disabling features such as iptables and such.
> >>
> >> It seems like your workload isn't scaling when you add additional CPUs
> >> as almost all of your traffic is being delivered on just a small set
> >> of queues.  In order for RSS to be able to spread out traffic there
> >> needs to be enough differences between the flows.  From what I can
> >> tell it looks like the variance in multicast address is not great
> >> enough to make a substantial difference in where the flows are sent as
> >> only 8 out of the 16 available queues are being used.  It might be
> >> useful if you could post a packet capture showing a slice of few
> >> thousand frames, if I am not mistaken that should be about 2MB or 3MB.
> >> The general idea is to get an idea of what the flow looks like, if it
> >> is a serialized flow where it is bursting all of the traffic for one
> >> flow at a time, or if they are well interleaved.  Also it would tell
> >> me why we aren't seeing a good distribution as enabling UDP rss via
> >> "ethtool -N <iface> rx-flow-hash udp4 sdfn" should have given us a
> >> good spread of the traffic and from the sound of things we just aren't
> >> seeing that..
> >>
> >> Finally adding additional buffering by increasing the ring size to
> >> 8192 wouldn't provide any additional throughput, if anything it would
> >> just slow things down more.  For more information on the effect you
> >> could search the internet for the term "buffer bloat".  One thing you
> >> might try is reducing the Rx ring size to 256 buffers instead of 512.
> >> Sometimes that can provide a small bit of improvement as it reduces
> >> the descriptor ring size to 1 4K page instead of using 8K as it
> >> normally does.  When multiple rings are active simultaneously this can
> >> reduce the cache footprint which in turn can improve cache utilization
> >> as you are less likely to evict data out of the L3 cache that was
> >> placed there by DDIO.
> >>
> >> I hope you find some of this information useful.
> >>
> >> - Alex
> >>
> >> On Thu, Sep 8, 2016 at 8:58 AM, Hank Liu <hank.tz...@gmail.com> wrote:
> >> > Hi Alex,
> >> >
> >> > see attached. thanks!
> >> >
> >> > Hank
> >> >
> >> > On Wed, Sep 7, 2016 at 7:32 PM, Alexander Duyck
> >> > <alexander.du...@gmail.com>
> >> > wrote:
> >> >>
> >> >> Can you send me an lspci -vvv dump for the card.  The main piece I am
> >> >> interested in seeing is the link status register output.  I just want
> >> >> to verify that you are linked at x8 gen2.
> >> >>
> >> >> - Alex
> >> >>
> >> >> On Wed, Sep 7, 2016 at 4:00 PM, Hank Liu <hank.tz...@gmail.com>
> wrote:
> >> >> > Nope, no help. Still seeing pause frames or rx_no_dma_resource when
> >> >> > BW
> >> >> > is up
> >> >> > to 8 Gbps...
> >> >> >
> >> >> > On Wed, Sep 7, 2016 at 3:42 PM, Hank Liu <hank.tz...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> HI Alexander,
> >> >> >>
> >> >> >> Thanks for your input. Will give it a try.
> >> >> >>
> >> >> >>
> >> >> >> Hank
> >> >> >>
> >> >> >> On Wed, Sep 7, 2016 at 3:23 PM, Alexander Duyck
> >> >> >> <alexander.du...@gmail.com> wrote:
> >> >> >>>
> >> >> >>> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D
> >> >> >>> <mark.d.rus...@intel.com>
> >> >> >>> wrote:
> >> >> >>> > Hank Liu <hank.tz...@gmail.com> wrote:
> >> >> >>> >
> >> >> >>> >>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> >> >> >>> >>> *Sent:* Wednesday, September 07, 2016 10:20 AM
> >> >> >>> >>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
> >> >> >>> >>> *Cc:* e1000-devel@lists.sourceforge.net
> >> >> >>> >>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards
> >> >> >>> >>> for
> >> >> >>> >>> 10G
> >> >> >>> >>> SFPs
> >> >> >>> >>> UDP performance issue
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>>
> >> >> >>> >>> Thanks for quick response and helping. I guess I didn't make
> it
> >> >> >>> >>> clear
> >> >> >>> >>> is
> >> >> >>> >>> that the application (receiver, sender) open 240 connections
> >> >> >>> >>> each
> >> >> >>> >>> connection has 34 Mbps traffic.
> >> >> >>> >
> >> >> >>> >
> >> >> >>> > You say that there are 240 connections, but how many threads is
> >> >> >>> > your
> >> >> >>> > app
> >> >> >>> > using? One per connection? What does the cpu utilization look
> >> >> >>> > like
> >> >> >>> > on
> >> >> >>> > the
> >> >> >>> > receiving end?
> >> >> >>> >
> >> >> >>> > Also, the current ATR implementation does not support UDP, so
> you
> >> >> >>> > are
> >> >> >>> > probably better off not pinning the app threads at all and
> >> >> >>> > trusting
> >> >> >>> > that the
> >> >> >>> > scheduler will migrate them to the cpu that is getting their
> >> >> >>> > packets
> >> >> >>> > via
> >> >> >>> > RSS. You should still set the affinity of the interrupts in
> that
> >> >> >>> > case.
> >> >> >>> > The
> >> >> >>> > default number of queues should be fine.
> >> >> >>>
> >> >> >>> If you are running point to point with UDP traffic and are not
> >> >> >>> fragmenting packets I would recommend enabling RSS for UDP flows.
> >> >> >>> You
> >> >> >>> can do that via the following command:
> >> >> >>> ethtool -N <interface> rx-flow-hash udp4 sdfn
> >> >> >>>
> >> >> >>> That should allow the work to spread to more queues than just the
> >> >> >>> one
> >> >> >>> that is currently being selected based on your source and
> >> >> >>> destination
> >> >> >>> IP addresses.
> >> >> >>>
> >> >> >>> - Alex
> >> >> >>
> >> >> >>
> >> >> >
> >> >
> >> >
> >
> >
>

------------------------------------------------------------------------------

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP performance issue

Reply via email to