Hi Don, I have enough budget to run app in aggressive way - i.e. use many core to deal with 1x10G card. If app is not draining socket data quick enough, I would expect seeing UDP layer packet drop but I didn't.
The reason I started tuning obviously is because I saw bad things - no resource or sending out pause frame with default settings. For example, rx ring size default is 512, I can see no dma resource even with 4G traffic under 512 rings. If I move up to 4096, then I can see it can handle up to 7-8 Gbps. I look into Intel 82599 controller data sheet, it appears to me that chip level support up to 8192 ring, but driver not support it. I am wondering if you guy can do something about it. May be that is the reason why we are seeing no dma resource? I have great concern when we need to deal with 4x10G ports. Any thought? Hank On Wed, Sep 7, 2016 at 6:07 PM, Skidmore, Donald C < donald.c.skidm...@intel.com> wrote: > Hey Hank, > > > > I must of have misread your ethtool stats as I only noticed 2 in the list > and you clearly have 8 queues in play below. Too much multitasking today > for me. :) > > > > That said it would still be nice with that many sockets as you could have > to be able to spread the load out to more CPU’s since it sounds like your > application would be capable of having enough threads to service them all. > This makes me think that attempting ATR would be in order, once again > assuming that the individual flows stay around long enough. If nothing > else I would mess with the RSS key to hopefully get better spread with your > current spread it wouldn’t be useful to use more than 8 threads. > > > > It might also be worth thinking about if you are reaching some application > limit, for some reason it isn’t able to drain those queues as fast as the > data is coming in. When you run say 8 parallel netperf UPD sessions what > kind of throughput do you see then? > > > > Thanks, > > -Don <donald.c.skidm...@intel.com> > > > > *From:* Hank Liu [mailto:hank.tz...@gmail.com] > *Sent:* Wednesday, September 07, 2016 5:42 PM > > *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> > *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>; > e1000-devel@lists.sourceforge.net > *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs > UDP performance issue > > > > Hi Don, > > > > Below is snippet of the full log... How to know it's only go into 2 > queues? I saw more than 2 queues has similar packets number... Can you > explain more? > > > > If it is two queues, would that imply 2 cores will handle 2 flows, right? > But, > > from watch -d -n1 cat /proc/interrupts, I can see interrupt rate increase > in the same order on those cores handling ethernet interrupts. > > > > About our traffic, basically same 34 Mbps stream sent into 240 multicast > address (225.82.10.0 - 225.82.10.119, 225.82.11.0 - 225.82.11.119). > Receiver opens up 240 socket to pull data out,check size, then toss it for > test purpose. > > > > Test application can do multiple threads as command line input. Each > thread will handle 240 / N connections, N is thread number. I don't see > much different in tern of behavior. > > > > Thanks! > > > > Hank > > _packets: 1105903 > > rx_queue_0_bytes: 1501816274 > > rx_queue_0_bp_poll_yield: 0 > > rx_queue_0_bp_misses: 0 > > rx_queue_0_bp_cleaned: 0 > > rx_queue_1_packets: 1108639 > > rx_queue_1_bytes: 1505531762 > > rx_queue_1_bp_poll_yield: 0 > > rx_queue_1_bp_misses: 0 > > rx_queue_1_bp_cleaned: 0 > > rx_queue_2_packets: 0 > > rx_queue_2_bytes: 0 > > rx_queue_2_bp_poll_yield: 0 > > rx_queue_2_bp_misses: 0 > > rx_queue_2_bp_cleaned: 0 > > rx_queue_3_packets: 0 > > rx_queue_3_bytes: 0 > > rx_queue_3_bp_poll_yield: 0 > > rx_queue_3_bp_misses: 0 > > rx_queue_3_bp_cleaned: 0 > > rx_queue_4_packets: 1656985 > > rx_queue_4_bytes: 2250185630 > > rx_queue_4_bp_poll_yield: 0 > > rx_queue_4_bp_misses: 0 > > rx_queue_4_bp_cleaned: 0 > > rx_queue_5_packets: 1107023 > > rx_queue_5_bytes: 1503337234 > > rx_queue_5_bp_poll_yield: 0 > > rx_queue_5_bp_misses: 0 > > rx_queue_5_bp_cleaned: 0 > > rx_queue_6_packets: 0 > > rx_queue_6_bytes: 0 > > rx_queue_6_bp_poll_yield: 0 > > rx_queue_6_bp_misses: 0 > > rx_queue_6_bp_cleaned: 0 > > rx_queue_7_packets: 0 > > rx_queue_7_bytes: 0 > > rx_queue_7_bp_poll_yield: 0 > > rx_queue_7_bp_misses: 0 > > rx_queue_7_bp_cleaned: 0 > > rx_queue_8_packets: 0 > > rx_queue_8_bytes: 0 > > rx_queue_8_bp_poll_yield: 0 > > rx_queue_8_bp_misses: 0 > > rx_queue_8_bp_cleaned: 0 > > rx_queue_9_packets: 0 > > rx_queue_9_bytes: 0 > > rx_queue_9_bp_poll_yield: 0 > > rx_queue_9_bp_misses: 0 > > rx_queue_9_bp_cleaned: 0 > > rx_queue_10_packets: 1668431 > > rx_queue_10_bytes: 2265729298 > > rx_queue_10_bp_poll_yield: 0 > > rx_queue_10_bp_misses: 0 > > rx_queue_10_bp_cleaned: 0 > > rx_queue_11_packets: 1106051 > > rx_queue_11_bytes: 1502017258 > > rx_queue_11_bp_poll_yield: 0 > > rx_queue_11_bp_misses: 0 > > rx_queue_11_bp_cleaned: 0 > > rx_queue_12_packets: 0 > > rx_queue_12_bytes: 0 > > rx_queue_12_bp_poll_yield: 0 > > rx_queue_12_bp_misses: 0 > > rx_queue_12_bp_cleaned: 0 > > rx_queue_13_packets: 0 > > rx_queue_13_bytes: 0 > > rx_queue_13_bp_poll_yield: 0 > > rx_queue_13_bp_misses: 0 > > rx_queue_13_bp_cleaned: 0 > > rx_queue_14_packets: 1107157 > > rx_queue_14_bytes: 1503519206 > > rx_queue_14_bp_poll_yield: 0 > > rx_queue_14_bp_misses: 0 > > rx_queue_14_bp_cleaned: 0 > > rx_queue_15_packets: 1107574 > > rx_queue_15_bytes: 1504085492 > > rx_queue_15_bp_poll_yield: 0 > > rx_queue_15_bp_misses: 0 > > rx_queue_15_bp_cleaned: 0 > > rx_queue_16_packets: 0 > > rx_queue_16_bytes: 0 > > rx_queue_16_bp_poll_yield: 0 > > rx_queue_16_bp_misses: 0 > > rx_queue_1 > > > > On Wed, Sep 7, 2016 at 5:04 PM, Skidmore, Donald C < > donald.c.skidm...@intel.com> wrote: > > Hey Hank, > > > > Well looks like all your traffic is just hashing to 2 queues. You have > ATR enabled but it isn’t being used, do to this being UDP traffic. That > isn’t a problem since RSS hash will occur on anything that doesn’t match > ATR (in your case everything). All this means that you only have 2 flows > and thus all the work is being done with only two queues. To get a better > hash spread you could modify the RSS hash key, but I would first look at > your traffic to see if you even have more than 2 flows operating. Maybe > something can be done in the application to allow for more parallelism, run > for threads for instances (assuming each thread opens its own socket)? > > > > As for the rx_no_dma_resource counter it is tried in directly to one of > our HW counters. It gets bumped if the target queue is disabled (unlikely > in your case) or there are no free descriptors in the target queue. The > later makes sense since all of your traffic is going to just two queue that > appear to not be getting drained fast enough. > > > > Thanks, > > -Don <donald.c.skidm...@intel.com> > > > > > > > > *From:* Hank Liu [mailto:hank.tz...@gmail.com] > *Sent:* Wednesday, September 07, 2016 4:51 PM > *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> > *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>; > e1000-devel@lists.sourceforge.net > > > *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs > UDP performance issue > > > > Hi Don, > > > > I got log for you to look at. See attached... > > > > Thanks and let me know. BTW, can anyone tell me what could cause > rx_no_dma_resource? > > > > Hank > > > > On Wed, Sep 7, 2016 at 4:04 PM, Skidmore, Donald C < > donald.c.skidm...@intel.com> wrote: > > ATR is application targeted receive. It may be useful for you but the > flow isn’t directed to a CPU until you transmit and since you mentioned you > don’t do much transmission it would have to be via the ACK’s. Likewise the > flows will need to stick around for a while to gain any advantage from it. > Still it wouldn’t hurt to test using the ethtool command Alex mentioned in > another email. > > > > In general I would like to see you just go with the default of 16 RSS > queues and not attempt to mess with the affinization of the interrupt > vectors. If the performance is still bad I would be interested how the > flows were being distributed between the queues. You can see this via > packet counts per queue you get out of ethtool stats. What I want to > eliminate is the possibility that RSS is seeing all your traffic as one > flow. > > > > Thanks, > > -Don <donald.c.skidm...@intel.com> > > > > > > *From:* Hank Liu [mailto:hank.tz...@gmail.com] > *Sent:* Wednesday, September 07, 2016 3:40 PM > *To:* Rustad, Mark D <mark.d.rus...@intel.com> > *Cc:* Skidmore, Donald C <donald.c.skidm...@intel.com>; > e1000-devel@lists.sourceforge.net > *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs > UDP performance issue > > > > Mark, > > > > Thanks! > > > > Test app can specify how many pthread to handle connections. I have tried > 4, 8, 16, etc, but none of them make significant difference. CPU usage on > receive end is moderate (50-60%). If I want to poll aggressively to prevent > any drop in UDP layer, then it might go up a bit. On the CPU set that > handle network interrupts, I did pin those CPUs, I can see interrupt rate > is pretty even on all CPUs involved. > > > > Since seeing a lot of rx_no_dma_resource and this counter is read out > through 82599 controller, I like to know why it happened. Note: I already > bumped rx ring size to maximum (4096) I can set in ethtool. > > > > BTW, what is ATR? I didn't set up any filter... > > > > > > Hank > > > > On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D <mark.d.rus...@intel.com> > wrote: > > Hank Liu <hank.tz...@gmail.com> wrote: > > *From:* Hank Liu [mailto:hank.tz...@gmail.com] > *Sent:* Wednesday, September 07, 2016 10:20 AM > *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> > *Cc:* e1000-devel@lists.sourceforge.net > *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs > UDP performance issue > > > > Thanks for quick response and helping. I guess I didn't make it clear is > that the application (receiver, sender) open 240 connections each > connection has 34 Mbps traffic. > > > You say that there are 240 connections, but how many threads is your app > using? One per connection? What does the cpu utilization look like on the > receiving end? > > Also, the current ATR implementation does not support UDP, so you are > probably better off not pinning the app threads at all and trusting that > the scheduler will migrate them to the cpu that is getting their packets > via RSS. You should still set the affinity of the interrupts in that case. > The default number of queues should be fine. > > -- > Mark Rustad, Networking Division, Intel Corporation > > > > > > >
------------------------------------------------------------------------------
_______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired