I forgot to ask a question about how to set interrupt modulation and make sure it's set. I get used to proset on Windows world. With linux, i use ethtool -C to set it but it only allow me to set rx-usecs, setting other fields like rx-frames, etc doesn't honor it.
So, what is the right way to set it? Thanks! Hank On Thu, Sep 8, 2016 at 12:16 PM, Hank Liu <hank.tz...@gmail.com> wrote: > Hi Don, > > I already hacked the ixgbe driver but I don't expect that will make much > difference. I am missing something still... > > May be I need to look hard on DPDK down the road, but for now, I am hoping > there is a simple solution to get my product going... > > > Hank > > On Thu, Sep 8, 2016 at 10:21 AM, Skidmore, Donald C < > donald.c.skidm...@intel.com> wrote: > >> Hey Hank, >> >> >> >> The reason I was interested in you results with parallel UPD netperf test >> is 1) if we didn’t see the problem there it would be a strong indication >> that the bottle neck is above the stack or 2) if we did see the same >> performance issue I would be able to recreate internally here. >> >> >> >> I’m not as hopeful for the increasing the ring count size. Increasing >> this buffer would not be helpful if our descriptors aren’t being drain as >> fast as data is coming in. With TCP I might expect smaller rings and busty >> traffic to lead to retransmits which would affect BW, but with UPD that >> would only come into play based on the protocol above UDP. Once again a >> netperf test would help demonstrate if this was happening. If you want to >> try it you could hack up the driver and give it a shot. I haven’t looked >> in detail but it might be as simple as bumping up the IXGBE_MAX_RXD >> define. It would be at least a good place to start. >> >> >> >> Thanks, >> >> -Don >> >> >> >> *From:* Hank Liu [mailto:hank.tz...@gmail.com] >> *Sent:* Wednesday, September 07, 2016 6:40 PM >> >> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> >> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>; >> e1000-devel@lists.sourceforge.net >> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G >> SFPs UDP performance issue >> >> >> >> Hi Don, >> >> >> >> I have enough budget to run app in aggressive way - i.e. use many core to >> deal with 1x10G card. If app is not draining socket data quick enough, I >> would expect seeing UDP layer packet drop but I didn't. >> >> >> >> The reason I started tuning obviously is because I saw bad things - no >> resource or sending out pause frame with default settings. For example, rx >> ring size default is 512, I can see no dma resource even with 4G traffic >> under 512 rings. If I move up to 4096, then I can see it can handle up to >> 7-8 Gbps. I look into Intel 82599 controller data sheet, it appears to me >> that chip level support up to 8192 ring, but driver not support it. I am >> wondering if you guy can do something about it. May be that is the reason >> why we are seeing no dma resource? >> >> >> >> I have great concern when we need to deal with 4x10G ports. Any thought? >> >> >> >> >> >> Hank >> >> >> >> On Wed, Sep 7, 2016 at 6:07 PM, Skidmore, Donald C < >> donald.c.skidm...@intel.com> wrote: >> >> Hey Hank, >> >> >> >> I must of have misread your ethtool stats as I only noticed 2 in the list >> and you clearly have 8 queues in play below. Too much multitasking today >> for me. :) >> >> >> >> That said it would still be nice with that many sockets as you could have >> to be able to spread the load out to more CPU’s since it sounds like your >> application would be capable of having enough threads to service them all. >> This makes me think that attempting ATR would be in order, once again >> assuming that the individual flows stay around long enough. If nothing >> else I would mess with the RSS key to hopefully get better spread with your >> current spread it wouldn’t be useful to use more than 8 threads. >> >> >> >> It might also be worth thinking about if you are reaching some >> application limit, for some reason it isn’t able to drain those queues as >> fast as the data is coming in. When you run say 8 parallel netperf UPD >> sessions what kind of throughput do you see then? >> >> >> >> Thanks, >> >> -Don <donald.c.skidm...@intel.com> >> >> >> >> *From:* Hank Liu [mailto:hank.tz...@gmail.com] >> *Sent:* Wednesday, September 07, 2016 5:42 PM >> >> >> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> >> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>; >> e1000-devel@lists.sourceforge.net >> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G >> SFPs UDP performance issue >> >> >> >> Hi Don, >> >> >> >> Below is snippet of the full log... How to know it's only go into 2 >> queues? I saw more than 2 queues has similar packets number... Can you >> explain more? >> >> >> >> If it is two queues, would that imply 2 cores will handle 2 flows, right? >> But, >> >> from watch -d -n1 cat /proc/interrupts, I can see interrupt rate increase >> in the same order on those cores handling ethernet interrupts. >> >> >> >> About our traffic, basically same 34 Mbps stream sent into 240 multicast >> address (225.82.10.0 - 225.82.10.119, 225.82.11.0 - 225.82.11.119). >> Receiver opens up 240 socket to pull data out,check size, then toss it for >> test purpose. >> >> >> >> Test application can do multiple threads as command line input. Each >> thread will handle 240 / N connections, N is thread number. I don't see >> much different in tern of behavior. >> >> >> >> Thanks! >> >> >> >> Hank >> >> _packets: 1105903 >> >> rx_queue_0_bytes: 1501816274 >> >> rx_queue_0_bp_poll_yield: 0 >> >> rx_queue_0_bp_misses: 0 >> >> rx_queue_0_bp_cleaned: 0 >> >> rx_queue_1_packets: 1108639 >> >> rx_queue_1_bytes: 1505531762 >> >> rx_queue_1_bp_poll_yield: 0 >> >> rx_queue_1_bp_misses: 0 >> >> rx_queue_1_bp_cleaned: 0 >> >> rx_queue_2_packets: 0 >> >> rx_queue_2_bytes: 0 >> >> rx_queue_2_bp_poll_yield: 0 >> >> rx_queue_2_bp_misses: 0 >> >> rx_queue_2_bp_cleaned: 0 >> >> rx_queue_3_packets: 0 >> >> rx_queue_3_bytes: 0 >> >> rx_queue_3_bp_poll_yield: 0 >> >> rx_queue_3_bp_misses: 0 >> >> rx_queue_3_bp_cleaned: 0 >> >> rx_queue_4_packets: 1656985 >> >> rx_queue_4_bytes: 2250185630 >> >> rx_queue_4_bp_poll_yield: 0 >> >> rx_queue_4_bp_misses: 0 >> >> rx_queue_4_bp_cleaned: 0 >> >> rx_queue_5_packets: 1107023 >> >> rx_queue_5_bytes: 1503337234 >> >> rx_queue_5_bp_poll_yield: 0 >> >> rx_queue_5_bp_misses: 0 >> >> rx_queue_5_bp_cleaned: 0 >> >> rx_queue_6_packets: 0 >> >> rx_queue_6_bytes: 0 >> >> rx_queue_6_bp_poll_yield: 0 >> >> rx_queue_6_bp_misses: 0 >> >> rx_queue_6_bp_cleaned: 0 >> >> rx_queue_7_packets: 0 >> >> rx_queue_7_bytes: 0 >> >> rx_queue_7_bp_poll_yield: 0 >> >> rx_queue_7_bp_misses: 0 >> >> rx_queue_7_bp_cleaned: 0 >> >> rx_queue_8_packets: 0 >> >> rx_queue_8_bytes: 0 >> >> rx_queue_8_bp_poll_yield: 0 >> >> rx_queue_8_bp_misses: 0 >> >> rx_queue_8_bp_cleaned: 0 >> >> rx_queue_9_packets: 0 >> >> rx_queue_9_bytes: 0 >> >> rx_queue_9_bp_poll_yield: 0 >> >> rx_queue_9_bp_misses: 0 >> >> rx_queue_9_bp_cleaned: 0 >> >> rx_queue_10_packets: 1668431 >> >> rx_queue_10_bytes: 2265729298 >> >> rx_queue_10_bp_poll_yield: 0 >> >> rx_queue_10_bp_misses: 0 >> >> rx_queue_10_bp_cleaned: 0 >> >> rx_queue_11_packets: 1106051 >> >> rx_queue_11_bytes: 1502017258 >> >> rx_queue_11_bp_poll_yield: 0 >> >> rx_queue_11_bp_misses: 0 >> >> rx_queue_11_bp_cleaned: 0 >> >> rx_queue_12_packets: 0 >> >> rx_queue_12_bytes: 0 >> >> rx_queue_12_bp_poll_yield: 0 >> >> rx_queue_12_bp_misses: 0 >> >> rx_queue_12_bp_cleaned: 0 >> >> rx_queue_13_packets: 0 >> >> rx_queue_13_bytes: 0 >> >> rx_queue_13_bp_poll_yield: 0 >> >> rx_queue_13_bp_misses: 0 >> >> rx_queue_13_bp_cleaned: 0 >> >> rx_queue_14_packets: 1107157 >> >> rx_queue_14_bytes: 1503519206 >> >> rx_queue_14_bp_poll_yield: 0 >> >> rx_queue_14_bp_misses: 0 >> >> rx_queue_14_bp_cleaned: 0 >> >> rx_queue_15_packets: 1107574 >> >> rx_queue_15_bytes: 1504085492 >> >> rx_queue_15_bp_poll_yield: 0 >> >> rx_queue_15_bp_misses: 0 >> >> rx_queue_15_bp_cleaned: 0 >> >> rx_queue_16_packets: 0 >> >> rx_queue_16_bytes: 0 >> >> rx_queue_16_bp_poll_yield: 0 >> >> rx_queue_16_bp_misses: 0 >> >> rx_queue_1 >> >> >> >> On Wed, Sep 7, 2016 at 5:04 PM, Skidmore, Donald C < >> donald.c.skidm...@intel.com> wrote: >> >> Hey Hank, >> >> >> >> Well looks like all your traffic is just hashing to 2 queues. You have >> ATR enabled but it isn’t being used, do to this being UDP traffic. That >> isn’t a problem since RSS hash will occur on anything that doesn’t match >> ATR (in your case everything). All this means that you only have 2 flows >> and thus all the work is being done with only two queues. To get a better >> hash spread you could modify the RSS hash key, but I would first look at >> your traffic to see if you even have more than 2 flows operating. Maybe >> something can be done in the application to allow for more parallelism, run >> for threads for instances (assuming each thread opens its own socket)? >> >> >> >> As for the rx_no_dma_resource counter it is tried in directly to one of >> our HW counters. It gets bumped if the target queue is disabled (unlikely >> in your case) or there are no free descriptors in the target queue. The >> later makes sense since all of your traffic is going to just two queue that >> appear to not be getting drained fast enough. >> >> >> >> Thanks, >> >> -Don <donald.c.skidm...@intel.com> >> >> >> >> >> >> >> >> *From:* Hank Liu [mailto:hank.tz...@gmail.com] >> *Sent:* Wednesday, September 07, 2016 4:51 PM >> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> >> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>; >> e1000-devel@lists.sourceforge.net >> >> >> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G >> SFPs UDP performance issue >> >> >> >> Hi Don, >> >> >> >> I got log for you to look at. See attached... >> >> >> >> Thanks and let me know. BTW, can anyone tell me what could cause >> rx_no_dma_resource? >> >> >> >> Hank >> >> >> >> On Wed, Sep 7, 2016 at 4:04 PM, Skidmore, Donald C < >> donald.c.skidm...@intel.com> wrote: >> >> ATR is application targeted receive. It may be useful for you but the >> flow isn’t directed to a CPU until you transmit and since you mentioned you >> don’t do much transmission it would have to be via the ACK’s. Likewise the >> flows will need to stick around for a while to gain any advantage from it. >> Still it wouldn’t hurt to test using the ethtool command Alex mentioned in >> another email. >> >> >> >> In general I would like to see you just go with the default of 16 RSS >> queues and not attempt to mess with the affinization of the interrupt >> vectors. If the performance is still bad I would be interested how the >> flows were being distributed between the queues. You can see this via >> packet counts per queue you get out of ethtool stats. What I want to >> eliminate is the possibility that RSS is seeing all your traffic as one >> flow. >> >> >> >> Thanks, >> >> -Don <donald.c.skidm...@intel.com> >> >> >> >> >> >> *From:* Hank Liu [mailto:hank.tz...@gmail.com] >> *Sent:* Wednesday, September 07, 2016 3:40 PM >> *To:* Rustad, Mark D <mark.d.rus...@intel.com> >> *Cc:* Skidmore, Donald C <donald.c.skidm...@intel.com>; >> e1000-devel@lists.sourceforge.net >> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G >> SFPs UDP performance issue >> >> >> >> Mark, >> >> >> >> Thanks! >> >> >> >> Test app can specify how many pthread to handle connections. I have tried >> 4, 8, 16, etc, but none of them make significant difference. CPU usage on >> receive end is moderate (50-60%). If I want to poll aggressively to prevent >> any drop in UDP layer, then it might go up a bit. On the CPU set that >> handle network interrupts, I did pin those CPUs, I can see interrupt rate >> is pretty even on all CPUs involved. >> >> >> >> Since seeing a lot of rx_no_dma_resource and this counter is read out >> through 82599 controller, I like to know why it happened. Note: I already >> bumped rx ring size to maximum (4096) I can set in ethtool. >> >> >> >> BTW, what is ATR? I didn't set up any filter... >> >> >> >> >> >> Hank >> >> >> >> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D <mark.d.rus...@intel.com> >> wrote: >> >> Hank Liu <hank.tz...@gmail.com> wrote: >> >> *From:* Hank Liu [mailto:hank.tz...@gmail.com] >> *Sent:* Wednesday, September 07, 2016 10:20 AM >> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com> >> *Cc:* e1000-devel@lists.sourceforge.net >> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs >> UDP performance issue >> >> >> >> Thanks for quick response and helping. I guess I didn't make it clear is >> that the application (receiver, sender) open 240 connections each >> connection has 34 Mbps traffic. >> >> >> You say that there are 240 connections, but how many threads is your app >> using? One per connection? What does the cpu utilization look like on the >> receiving end? >> >> Also, the current ATR implementation does not support UDP, so you are >> probably better off not pinning the app threads at all and trusting that >> the scheduler will migrate them to the cpu that is getting their packets >> via RSS. You should still set the affinity of the interrupts in that case. >> The default number of queues should be fine. >> >> -- >> Mark Rustad, Networking Division, Intel Corporation >> >> >> >> >> >> >> >> >> > >
------------------------------------------------------------------------------
_______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired