Hi Alex,

I appreciate your input. As you pointed out that we would expect single
queue and single xeon cpu core should be able to handle say 5-6G. Then, if
RSS come in place, we should easily handle close to 10G. However, with 8 or
16 queues, I still get pause frame. Without any basic problems - wrong pci
setup, wrong interrupt mechanism. I can see the overhead in network stack
with a lot of socket connection. However, unless the buffer is hold by
upper layer protocol, mac driver should not be tied up. At most, it will
drop on protocol layer. That is the puzzle to solve.

I mentioned about rx ring size more or so is to see if that will trigger
some thought on that direction. I know you will still run into problem just
matter of time only.

About traffic pattern, I have no control what it could be. Customer could
setup whatever they think it makes sense to them. So, I have to prepare the
worst case, just like cache missed, we need to handle that anyway.

Hank

On Thu, Sep 8, 2016 at 10:16 AM, Alexander Duyck <alexander.du...@gmail.com>
wrote:

> So the lspci looks good.  It looks like everything is optimal there.
> From what I can tell it looks like the NICs are in slots associated
> with NUMA node 0.
>
> So my thought on all this is that what is likely limiting your
> throughput is most likely the packet processing overhead associated
> with the fact that your frames are around 1300 bytes in length, and
> the fact that the addresses are multicast.  Normally I would expect a
> single queue/thread to process somewhere around 8GB/s optimally
> configured.  The fact is the kernel itself normally can't handle much
> more than that without disabling features such as iptables and such.
>
> It seems like your workload isn't scaling when you add additional CPUs
> as almost all of your traffic is being delivered on just a small set
> of queues.  In order for RSS to be able to spread out traffic there
> needs to be enough differences between the flows.  From what I can
> tell it looks like the variance in multicast address is not great
> enough to make a substantial difference in where the flows are sent as
> only 8 out of the 16 available queues are being used.  It might be
> useful if you could post a packet capture showing a slice of few
> thousand frames, if I am not mistaken that should be about 2MB or 3MB.
> The general idea is to get an idea of what the flow looks like, if it
> is a serialized flow where it is bursting all of the traffic for one
> flow at a time, or if they are well interleaved.  Also it would tell
> me why we aren't seeing a good distribution as enabling UDP rss via
> "ethtool -N <iface> rx-flow-hash udp4 sdfn" should have given us a
> good spread of the traffic and from the sound of things we just aren't
> seeing that..
>
> Finally adding additional buffering by increasing the ring size to
> 8192 wouldn't provide any additional throughput, if anything it would
> just slow things down more.  For more information on the effect you
> could search the internet for the term "buffer bloat".  One thing you
> might try is reducing the Rx ring size to 256 buffers instead of 512.
> Sometimes that can provide a small bit of improvement as it reduces
> the descriptor ring size to 1 4K page instead of using 8K as it
> normally does.  When multiple rings are active simultaneously this can
> reduce the cache footprint which in turn can improve cache utilization
> as you are less likely to evict data out of the L3 cache that was
> placed there by DDIO.
>
> I hope you find some of this information useful.
>
> - Alex
>
> On Thu, Sep 8, 2016 at 8:58 AM, Hank Liu <hank.tz...@gmail.com> wrote:
> > Hi Alex,
> >
> > see attached. thanks!
> >
> > Hank
> >
> > On Wed, Sep 7, 2016 at 7:32 PM, Alexander Duyck <
> alexander.du...@gmail.com>
> > wrote:
> >>
> >> Can you send me an lspci -vvv dump for the card.  The main piece I am
> >> interested in seeing is the link status register output.  I just want
> >> to verify that you are linked at x8 gen2.
> >>
> >> - Alex
> >>
> >> On Wed, Sep 7, 2016 at 4:00 PM, Hank Liu <hank.tz...@gmail.com> wrote:
> >> > Nope, no help. Still seeing pause frames or rx_no_dma_resource when BW
> >> > is up
> >> > to 8 Gbps...
> >> >
> >> > On Wed, Sep 7, 2016 at 3:42 PM, Hank Liu <hank.tz...@gmail.com>
> wrote:
> >> >>
> >> >> HI Alexander,
> >> >>
> >> >> Thanks for your input. Will give it a try.
> >> >>
> >> >>
> >> >> Hank
> >> >>
> >> >> On Wed, Sep 7, 2016 at 3:23 PM, Alexander Duyck
> >> >> <alexander.du...@gmail.com> wrote:
> >> >>>
> >> >>> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D
> >> >>> <mark.d.rus...@intel.com>
> >> >>> wrote:
> >> >>> > Hank Liu <hank.tz...@gmail.com> wrote:
> >> >>> >
> >> >>> >>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> >> >>> >>> *Sent:* Wednesday, September 07, 2016 10:20 AM
> >> >>> >>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
> >> >>> >>> *Cc:* e1000-devel@lists.sourceforge.net
> >> >>> >>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for
> >> >>> >>> 10G
> >> >>> >>> SFPs
> >> >>> >>> UDP performance issue
> >> >>> >>>
> >> >>> >>>
> >> >>> >>>
> >> >>> >>> Thanks for quick response and helping. I guess I didn't make it
> >> >>> >>> clear
> >> >>> >>> is
> >> >>> >>> that the application (receiver, sender) open 240 connections
> each
> >> >>> >>> connection has 34 Mbps traffic.
> >> >>> >
> >> >>> >
> >> >>> > You say that there are 240 connections, but how many threads is
> your
> >> >>> > app
> >> >>> > using? One per connection? What does the cpu utilization look like
> >> >>> > on
> >> >>> > the
> >> >>> > receiving end?
> >> >>> >
> >> >>> > Also, the current ATR implementation does not support UDP, so you
> >> >>> > are
> >> >>> > probably better off not pinning the app threads at all and
> trusting
> >> >>> > that the
> >> >>> > scheduler will migrate them to the cpu that is getting their
> packets
> >> >>> > via
> >> >>> > RSS. You should still set the affinity of the interrupts in that
> >> >>> > case.
> >> >>> > The
> >> >>> > default number of queues should be fine.
> >> >>>
> >> >>> If you are running point to point with UDP traffic and are not
> >> >>> fragmenting packets I would recommend enabling RSS for UDP flows.
> You
> >> >>> can do that via the following command:
> >> >>> ethtool -N <interface> rx-flow-hash udp4 sdfn
> >> >>>
> >> >>> That should allow the work to spread to more queues than just the
> one
> >> >>> that is currently being selected based on your source and
> destination
> >> >>> IP addresses.
> >> >>>
> >> >>> - Alex
> >> >>
> >> >>
> >> >
> >
> >
>
------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to