Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP performance issue

Skidmore, Donald C Thu, 08 Sep 2016 10:23:04 -0700

Hey Hank,

The reason I was interested in you results with parallel UPD netperf test is 1) 
if we didn’t see the problem there it would be a strong indication that the 
bottle neck is above the stack or 2) if we did see the same performance issue I 
would be able to recreate internally here.


I’m not as hopeful for the increasing the ring count size.  Increasing this 
buffer would not be helpful if our descriptors aren’t being drain as fast as 
data is coming in.  With TCP I might expect smaller rings and busty traffic to 
lead to retransmits which would affect BW, but with UPD that would only come 
into play based on the protocol above UDP.  Once again a netperf test would 
help demonstrate if this was happening.  If you want to try it you could hack 
up the driver and give it a shot.  I haven’t looked in detail but it might be 
as simple as bumping up the IXGBE_MAX_RXD define.  It would be at least a good 
place to start.

Thanks,
-Don

From: Hank Liu [mailto:hank.tz...@gmail.com]
Sent: Wednesday, September 07, 2016 6:40 PM
To: Skidmore, Donald C <donald.c.skidm...@intel.com>
Cc: Rustad, Mark D <mark.d.rus...@intel.com>; e1000-devel@lists.sourceforge.net
Subject: Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP 
performance issue

Hi Don,

I have enough budget to run app in aggressive way - i.e. use many core to deal 
with 1x10G card. If app is not draining socket data quick enough, I would 
expect seeing UDP layer packet drop but I didn't.

The reason I started tuning obviously is because I saw bad things - no resource 
or sending out pause frame with default settings. For example, rx ring size 
default is 512, I can see no dma resource even with 4G traffic under 512 rings. 
If I move up to 4096, then I can see it can handle up to 7-8 Gbps. I look into 
Intel 82599 controller data sheet, it appears to me that chip level support up 
to 8192 ring, but driver not support it. I am wondering if you guy can do 
something about it. May be that is the reason why we are seeing no dma resource?

I have great concern when we need to deal with 4x10G ports. Any thought?


Hank

On Wed, Sep 7, 2016 at 6:07 PM, Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>> wrote:
Hey Hank,

I must of have misread your ethtool stats as I only noticed 2 in the list and 
you clearly have 8 queues in play below.  Too much multitasking today for me. :)

That said it would still be nice with that many sockets as you could have to be 
able to spread the load out to more CPU’s since it sounds like your application 
would be capable of having enough threads to service them all.  This makes me 
think that attempting ATR would be in order, once again assuming that the 
individual flows stay around long enough.  If nothing else I would mess with 
the RSS key to hopefully get better spread with your current spread it wouldn’t 
be useful to use more than 8 threads.

It might also be worth thinking about if you are reaching some application 
limit, for some reason it isn’t able to drain those queues as fast as the data 
is coming in.  When you run say 8 parallel netperf UPD sessions what kind of 
throughput do you see then?

Thanks,
-Don <donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>

From: Hank Liu [mailto:hank.tz...@gmail.com<mailto:hank.tz...@gmail.com>]
Sent: Wednesday, September 07, 2016 5:42 PM

To: Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>
Cc: Rustad, Mark D <mark.d.rus...@intel.com<mailto:mark.d.rus...@intel.com>>; 
e1000-devel@lists.sourceforge.net<mailto:e1000-devel@lists.sourceforge.net>
Subject: Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP 
performance issue

Hi Don,

Below is snippet of the full log... How to know it's only go into 2 queues? I 
saw more than 2 queues has similar packets number... Can you explain more?

If it is two queues, would that imply 2 cores will handle 2 flows, right? But,
from watch -d -n1 cat /proc/interrupts, I can see interrupt rate increase in 
the same order on those cores handling ethernet interrupts.

About our traffic, basically same 34 Mbps stream sent into 240 multicast 
address (225.82.10.0 - 225.82.10.119, 225.82.11.0 - 225.82.11.119). Receiver 
opens up 240 socket to pull data out,check size, then toss it for test purpose.

Test application can do multiple threads as command line input. Each thread 
will handle 240 / N connections, N is thread number. I don't see much different 
in tern of behavior.

Thanks!

Hank

_packets: 1105903

     rx_queue_0_bytes: 1501816274

     rx_queue_0_bp_poll_yield: 0

     rx_queue_0_bp_misses: 0

     rx_queue_0_bp_cleaned: 0

     rx_queue_1_packets: 1108639

     rx_queue_1_bytes: 1505531762

     rx_queue_1_bp_poll_yield: 0

     rx_queue_1_bp_misses: 0

     rx_queue_1_bp_cleaned: 0

     rx_queue_2_packets: 0

     rx_queue_2_bytes: 0

     rx_queue_2_bp_poll_yield: 0

     rx_queue_2_bp_misses: 0

     rx_queue_2_bp_cleaned: 0

     rx_queue_3_packets: 0

     rx_queue_3_bytes: 0

     rx_queue_3_bp_poll_yield: 0

     rx_queue_3_bp_misses: 0

     rx_queue_3_bp_cleaned: 0

     rx_queue_4_packets: 1656985

     rx_queue_4_bytes: 2250185630

     rx_queue_4_bp_poll_yield: 0

     rx_queue_4_bp_misses: 0

     rx_queue_4_bp_cleaned: 0

     rx_queue_5_packets: 1107023

     rx_queue_5_bytes: 1503337234

     rx_queue_5_bp_poll_yield: 0

     rx_queue_5_bp_misses: 0

     rx_queue_5_bp_cleaned: 0

     rx_queue_6_packets: 0

     rx_queue_6_bytes: 0

     rx_queue_6_bp_poll_yield: 0

     rx_queue_6_bp_misses: 0

     rx_queue_6_bp_cleaned: 0

     rx_queue_7_packets: 0

     rx_queue_7_bytes: 0

     rx_queue_7_bp_poll_yield: 0

     rx_queue_7_bp_misses: 0

     rx_queue_7_bp_cleaned: 0

     rx_queue_8_packets: 0

     rx_queue_8_bytes: 0

     rx_queue_8_bp_poll_yield: 0

     rx_queue_8_bp_misses: 0

     rx_queue_8_bp_cleaned: 0

     rx_queue_9_packets: 0

     rx_queue_9_bytes: 0

     rx_queue_9_bp_poll_yield: 0

     rx_queue_9_bp_misses: 0

     rx_queue_9_bp_cleaned: 0

     rx_queue_10_packets: 1668431

     rx_queue_10_bytes: 2265729298<tel:2265729298>

     rx_queue_10_bp_poll_yield: 0

     rx_queue_10_bp_misses: 0

     rx_queue_10_bp_cleaned: 0

     rx_queue_11_packets: 1106051

     rx_queue_11_bytes: 1502017258

     rx_queue_11_bp_poll_yield: 0

     rx_queue_11_bp_misses: 0

     rx_queue_11_bp_cleaned: 0

     rx_queue_12_packets: 0

     rx_queue_12_bytes: 0

     rx_queue_12_bp_poll_yield: 0

     rx_queue_12_bp_misses: 0

     rx_queue_12_bp_cleaned: 0

     rx_queue_13_packets: 0

     rx_queue_13_bytes: 0

     rx_queue_13_bp_poll_yield: 0

     rx_queue_13_bp_misses: 0

     rx_queue_13_bp_cleaned: 0

     rx_queue_14_packets: 1107157

     rx_queue_14_bytes: 1503519206

     rx_queue_14_bp_poll_yield: 0

     rx_queue_14_bp_misses: 0

     rx_queue_14_bp_cleaned: 0

     rx_queue_15_packets: 1107574

     rx_queue_15_bytes: 1504085492

     rx_queue_15_bp_poll_yield: 0

     rx_queue_15_bp_misses: 0

     rx_queue_15_bp_cleaned: 0

     rx_queue_16_packets: 0

     rx_queue_16_bytes: 0

     rx_queue_16_bp_poll_yield: 0

     rx_queue_16_bp_misses: 0

     rx_queue_1

On Wed, Sep 7, 2016 at 5:04 PM, Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>> wrote:
Hey Hank,

Well looks like all your traffic is just hashing to 2 queues.  You have ATR 
enabled but it isn’t being used, do to this being UDP traffic.  That isn’t a 
problem since RSS hash will occur on anything that doesn’t match ATR (in your 
case everything).  All this means that you only have 2 flows and thus all the 
work is being done with only two queues.  To get a better hash spread you could 
modify the RSS hash key, but I would first look at your traffic to see if you 
even have more than 2 flows operating.  Maybe something can be done in the 
application to allow for more parallelism, run for threads for instances 
(assuming each thread opens its own socket)?

As for the rx_no_dma_resource counter it is tried in directly to one of our HW 
counters.  It gets bumped if the target queue is disabled (unlikely in your 
case) or there are no free descriptors in the target queue.  The later makes 
sense since all of your traffic is going to just two queue that appear to not 
be getting drained fast enough.

Thanks,
-Don <donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>



From: Hank Liu [mailto:hank.tz...@gmail.com<mailto:hank.tz...@gmail.com>]
Sent: Wednesday, September 07, 2016 4:51 PM
To: Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>
Cc: Rustad, Mark D <mark.d.rus...@intel.com<mailto:mark.d.rus...@intel.com>>; 
e1000-devel@lists.sourceforge.net<mailto:e1000-devel@lists.sourceforge.net>

Subject: Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP 
performance issue

Hi Don,

I got log for you to look at. See attached...

Thanks and let me know. BTW, can anyone tell me what could cause 
rx_no_dma_resource?

Hank

On Wed, Sep 7, 2016 at 4:04 PM, Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>> wrote:
ATR is application targeted receive.  It may be useful for you but the flow 
isn’t directed to a CPU until you transmit and since you mentioned you don’t do 
much transmission it would have to be via the ACK’s.  Likewise the flows will 
need to stick around for a while to gain any advantage from it.  Still it 
wouldn’t hurt to test using the ethtool command Alex mentioned in another email.

In general I would like to see you just go with the default of 16 RSS queues 
and not attempt to mess with the affinization of the interrupt vectors.  If the 
performance is still bad I would be interested how the flows were being 
distributed between the queues.  You can see this via packet counts per queue 
you get out of ethtool stats.  What I want to eliminate is the possibility that 
RSS is seeing all your traffic as one flow.

Thanks,
-Don <donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>


From: Hank Liu [mailto:hank.tz...@gmail.com<mailto:hank.tz...@gmail.com>]
Sent: Wednesday, September 07, 2016 3:40 PM
To: Rustad, Mark D <mark.d.rus...@intel.com<mailto:mark.d.rus...@intel.com>>
Cc: Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>; 
e1000-devel@lists.sourceforge.net<mailto:e1000-devel@lists.sourceforge.net>
Subject: Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP 
performance issue

Mark,

Thanks!

Test app can specify how many pthread to handle connections. I have tried 4, 8, 
16, etc, but none of them make significant difference. CPU usage on receive end 
is moderate (50-60%). If I want to poll aggressively to prevent any drop in UDP 
layer, then it might go up a bit. On the CPU set that handle network 
interrupts, I did pin those CPUs, I can see interrupt rate is pretty even on 
all CPUs involved.

Since seeing a lot of rx_no_dma_resource and this counter is read out through 
82599 controller, I like to know why it happened. Note: I already bumped rx 
ring size to maximum (4096) I can set in ethtool.

BTW, what is ATR? I didn't set up any filter...


Hank

On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D 
<mark.d.rus...@intel.com<mailto:mark.d.rus...@intel.com>> wrote:
Hank Liu <hank.tz...@gmail.com<mailto:hank.tz...@gmail.com>> wrote:
*From:* Hank Liu [mailto:hank.tz...@gmail.com<mailto:hank.tz...@gmail.com>]
*Sent:* Wednesday, September 07, 2016 10:20 AM
*To:* Skidmore, Donald C 
<donald.c.skidm...@intel.com<mailto:donald.c.skidm...@intel.com>>
*Cc:* 
e1000-devel@lists.sourceforge.net<mailto:e1000-devel@lists.sourceforge.net>
*Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
UDP performance issue



Thanks for quick response and helping. I guess I didn't make it clear is
that the application (receiver, sender) open 240 connections each
connection has 34 Mbps traffic.

You say that there are 240 connections, but how many threads is your app using? 
One per connection? What does the cpu utilization look like on the receiving 
end?

Also, the current ATR implementation does not support UDP, so you are probably 
better off not pinning the app threads at all and trusting that the scheduler 
will migrate them to the cpu that is getting their packets via RSS. You should 
still set the affinity of the interrupts in that case. The default number of 
queues should be fine.

--
Mark Rustad, Networking Division, Intel Corporation

------------------------------------------------------------------------------

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs UDP performance issue

Reply via email to