Hi,

After tuning and debugging, test results show aggregate BW is going down,
flow control xon or rx_no_dma_resources is up when connection number is
large enough, say 200. From perf trace, it shows over 40% CPU in RSS packet
handling core waiting for spin lock.

After upgraded Centos 7 kernel 3.10,0 and ixgbe driver 4.0.1-k-rh7.2 to
Linux kernel 4.7.3 and ixgbe 4.4.0, problem went away with same tuning
parameters and same app. I can get the BW I expected even with 480 sockets.
Sounds like it's a bug somewhere down?

Especially thanks Alex's help. Much appreciated.

Hank

On Wed, Sep 7, 2016 at 6:07 PM, Skidmore, Donald C <
donald.c.skidm...@intel.com> wrote:

> Hey Hank,
>
>
>
> I must of have misread your ethtool stats as I only noticed 2 in the list
> and you clearly have 8 queues in play below.  Too much multitasking today
> for me. :)
>
>
>
> That said it would still be nice with that many sockets as you could have
> to be able to spread the load out to more CPU’s since it sounds like your
> application would be capable of having enough threads to service them all.
> This makes me think that attempting ATR would be in order, once again
> assuming that the individual flows stay around long enough.  If nothing
> else I would mess with the RSS key to hopefully get better spread with your
> current spread it wouldn’t be useful to use more than 8 threads.
>
>
>
> It might also be worth thinking about if you are reaching some application
> limit, for some reason it isn’t able to drain those queues as fast as the
> data is coming in.  When you run say 8 parallel netperf UPD sessions what
> kind of throughput do you see then?
>
>
>
> Thanks,
>
> -Don <donald.c.skidm...@intel.com>
>
>
>
> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> *Sent:* Wednesday, September 07, 2016 5:42 PM
>
> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>;
> e1000-devel@lists.sourceforge.net
> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
> UDP performance issue
>
>
>
> Hi Don,
>
>
>
> Below is snippet of the full log... How to know it's only go into 2
> queues? I saw more than 2 queues has similar packets number... Can you
> explain more?
>
>
>
> If it is two queues, would that imply 2 cores will handle 2 flows, right?
> But,
>
> from watch -d -n1 cat /proc/interrupts, I can see interrupt rate increase
> in the same order on those cores handling ethernet interrupts.
>
>
>
> About our traffic, basically same 34 Mbps stream sent into 240 multicast
> address (225.82.10.0 - 225.82.10.119, 225.82.11.0 - 225.82.11.119).
> Receiver opens up 240 socket to pull data out,check size, then toss it for
> test purpose.
>
>
>
> Test application can do multiple threads as command line input. Each
> thread will handle 240 / N connections, N is thread number. I don't see
> much different in tern of behavior.
>
>
>
> Thanks!
>
>
>
> Hank
>
> _packets: 1105903
>
>      rx_queue_0_bytes: 1501816274
>
>      rx_queue_0_bp_poll_yield: 0
>
>      rx_queue_0_bp_misses: 0
>
>      rx_queue_0_bp_cleaned: 0
>
>      rx_queue_1_packets: 1108639
>
>      rx_queue_1_bytes: 1505531762
>
>      rx_queue_1_bp_poll_yield: 0
>
>      rx_queue_1_bp_misses: 0
>
>      rx_queue_1_bp_cleaned: 0
>
>      rx_queue_2_packets: 0
>
>      rx_queue_2_bytes: 0
>
>      rx_queue_2_bp_poll_yield: 0
>
>      rx_queue_2_bp_misses: 0
>
>      rx_queue_2_bp_cleaned: 0
>
>      rx_queue_3_packets: 0
>
>      rx_queue_3_bytes: 0
>
>      rx_queue_3_bp_poll_yield: 0
>
>      rx_queue_3_bp_misses: 0
>
>      rx_queue_3_bp_cleaned: 0
>
>      rx_queue_4_packets: 1656985
>
>      rx_queue_4_bytes: 2250185630
>
>      rx_queue_4_bp_poll_yield: 0
>
>      rx_queue_4_bp_misses: 0
>
>      rx_queue_4_bp_cleaned: 0
>
>      rx_queue_5_packets: 1107023
>
>      rx_queue_5_bytes: 1503337234
>
>      rx_queue_5_bp_poll_yield: 0
>
>      rx_queue_5_bp_misses: 0
>
>      rx_queue_5_bp_cleaned: 0
>
>      rx_queue_6_packets: 0
>
>      rx_queue_6_bytes: 0
>
>      rx_queue_6_bp_poll_yield: 0
>
>      rx_queue_6_bp_misses: 0
>
>      rx_queue_6_bp_cleaned: 0
>
>      rx_queue_7_packets: 0
>
>      rx_queue_7_bytes: 0
>
>      rx_queue_7_bp_poll_yield: 0
>
>      rx_queue_7_bp_misses: 0
>
>      rx_queue_7_bp_cleaned: 0
>
>      rx_queue_8_packets: 0
>
>      rx_queue_8_bytes: 0
>
>      rx_queue_8_bp_poll_yield: 0
>
>      rx_queue_8_bp_misses: 0
>
>      rx_queue_8_bp_cleaned: 0
>
>      rx_queue_9_packets: 0
>
>      rx_queue_9_bytes: 0
>
>      rx_queue_9_bp_poll_yield: 0
>
>      rx_queue_9_bp_misses: 0
>
>      rx_queue_9_bp_cleaned: 0
>
>      rx_queue_10_packets: 1668431
>
>      rx_queue_10_bytes: 2265729298
>
>      rx_queue_10_bp_poll_yield: 0
>
>      rx_queue_10_bp_misses: 0
>
>      rx_queue_10_bp_cleaned: 0
>
>      rx_queue_11_packets: 1106051
>
>      rx_queue_11_bytes: 1502017258
>
>      rx_queue_11_bp_poll_yield: 0
>
>      rx_queue_11_bp_misses: 0
>
>      rx_queue_11_bp_cleaned: 0
>
>      rx_queue_12_packets: 0
>
>      rx_queue_12_bytes: 0
>
>      rx_queue_12_bp_poll_yield: 0
>
>      rx_queue_12_bp_misses: 0
>
>      rx_queue_12_bp_cleaned: 0
>
>      rx_queue_13_packets: 0
>
>      rx_queue_13_bytes: 0
>
>      rx_queue_13_bp_poll_yield: 0
>
>      rx_queue_13_bp_misses: 0
>
>      rx_queue_13_bp_cleaned: 0
>
>      rx_queue_14_packets: 1107157
>
>      rx_queue_14_bytes: 1503519206
>
>      rx_queue_14_bp_poll_yield: 0
>
>      rx_queue_14_bp_misses: 0
>
>      rx_queue_14_bp_cleaned: 0
>
>      rx_queue_15_packets: 1107574
>
>      rx_queue_15_bytes: 1504085492
>
>      rx_queue_15_bp_poll_yield: 0
>
>      rx_queue_15_bp_misses: 0
>
>      rx_queue_15_bp_cleaned: 0
>
>      rx_queue_16_packets: 0
>
>      rx_queue_16_bytes: 0
>
>      rx_queue_16_bp_poll_yield: 0
>
>      rx_queue_16_bp_misses: 0
>
>      rx_queue_1
>
>
>
> On Wed, Sep 7, 2016 at 5:04 PM, Skidmore, Donald C <
> donald.c.skidm...@intel.com> wrote:
>
> Hey Hank,
>
>
>
> Well looks like all your traffic is just hashing to 2 queues.  You have
> ATR enabled but it isn’t being used, do to this being UDP traffic.  That
> isn’t a problem since RSS hash will occur on anything that doesn’t match
> ATR (in your case everything).  All this means that you only have 2 flows
> and thus all the work is being done with only two queues.  To get a better
> hash spread you could modify the RSS hash key, but I would first look at
> your traffic to see if you even have more than 2 flows operating.  Maybe
> something can be done in the application to allow for more parallelism, run
> for threads for instances (assuming each thread opens its own socket)?
>
>
>
> As for the rx_no_dma_resource counter it is tried in directly to one of
> our HW counters.  It gets bumped if the target queue is disabled (unlikely
> in your case) or there are no free descriptors in the target queue.  The
> later makes sense since all of your traffic is going to just two queue that
> appear to not be getting drained fast enough.
>
>
>
> Thanks,
>
> -Don <donald.c.skidm...@intel.com>
>
>
>
>
>
>
>
> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> *Sent:* Wednesday, September 07, 2016 4:51 PM
> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>;
> e1000-devel@lists.sourceforge.net
>
>
> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
> UDP performance issue
>
>
>
> Hi Don,
>
>
>
> I got log for you to look at. See attached...
>
>
>
> Thanks and let me know. BTW, can anyone tell me what could cause
> rx_no_dma_resource?
>
>
>
> Hank
>
>
>
> On Wed, Sep 7, 2016 at 4:04 PM, Skidmore, Donald C <
> donald.c.skidm...@intel.com> wrote:
>
> ATR is application targeted receive.  It may be useful for you but the
> flow isn’t directed to a CPU until you transmit and since you mentioned you
> don’t do much transmission it would have to be via the ACK’s.  Likewise the
> flows will need to stick around for a while to gain any advantage from it.
> Still it wouldn’t hurt to test using the ethtool command Alex mentioned in
> another email.
>
>
>
> In general I would like to see you just go with the default of 16 RSS
> queues and not attempt to mess with the affinization of the interrupt
> vectors.  If the performance is still bad I would be interested how the
> flows were being distributed between the queues.  You can see this via
> packet counts per queue you get out of ethtool stats.  What I want to
> eliminate is the possibility that RSS is seeing all your traffic as one
> flow.
>
>
>
> Thanks,
>
> -Don <donald.c.skidm...@intel.com>
>
>
>
>
>
> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> *Sent:* Wednesday, September 07, 2016 3:40 PM
> *To:* Rustad, Mark D <mark.d.rus...@intel.com>
> *Cc:* Skidmore, Donald C <donald.c.skidm...@intel.com>;
> e1000-devel@lists.sourceforge.net
> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
> UDP performance issue
>
>
>
> Mark,
>
>
>
> Thanks!
>
>
>
> Test app can specify how many pthread to handle connections. I have tried
> 4, 8, 16, etc, but none of them make significant difference. CPU usage on
> receive end is moderate (50-60%). If I want to poll aggressively to prevent
> any drop in UDP layer, then it might go up a bit. On the CPU set that
> handle network interrupts, I did pin those CPUs, I can see interrupt rate
> is pretty even on all CPUs involved.
>
>
>
> Since seeing a lot of rx_no_dma_resource and this counter is read out
> through 82599 controller, I like to know why it happened. Note: I already
> bumped rx ring size to maximum (4096) I can set in ethtool.
>
>
>
> BTW, what is ATR? I didn't set up any filter...
>
>
>
>
>
> Hank
>
>
>
> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D <mark.d.rus...@intel.com>
> wrote:
>
> Hank Liu <hank.tz...@gmail.com> wrote:
>
> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
> *Sent:* Wednesday, September 07, 2016 10:20 AM
> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
> *Cc:* e1000-devel@lists.sourceforge.net
> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
> UDP performance issue
>
>
>
> Thanks for quick response and helping. I guess I didn't make it clear is
> that the application (receiver, sender) open 240 connections each
> connection has 34 Mbps traffic.
>
>
> You say that there are 240 connections, but how many threads is your app
> using? One per connection? What does the cpu utilization look like on the
> receiving end?
>
> Also, the current ATR implementation does not support UDP, so you are
> probably better off not pinning the app threads at all and trusting that
> the scheduler will migrate them to the cpu that is getting their packets
> via RSS. You should still set the affinity of the interrupts in that case.
> The default number of queues should be fine.
>
> --
> Mark Rustad, Networking Division, Intel Corporation
>
>
>
>
>
>
>
------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to