I forgot to ask a question about how to set interrupt modulation and make
sure it's set. I get used to proset on Windows world. With linux, i use
ethtool -C to set it but it only allow me to set rx-usecs, setting other
fields like rx-frames, etc doesn't honor it.

So, what is the right way to set it?

Thanks!


Hank


On Thu, Sep 8, 2016 at 12:16 PM, Hank Liu <hank.tz...@gmail.com> wrote:

> Hi Don,
>
> I already hacked the ixgbe driver but I don't expect that will make much
> difference. I am missing something still...
>
> May be I need to look hard on DPDK down the road, but for now, I am hoping
> there is a simple solution to get my product going...
>
>
> Hank
>
> On Thu, Sep 8, 2016 at 10:21 AM, Skidmore, Donald C <
> donald.c.skidm...@intel.com> wrote:
>
>> Hey Hank,
>>
>>
>>
>> The reason I was interested in you results with parallel UPD netperf test
>> is 1) if we didn’t see the problem there it would be a strong indication
>> that the bottle neck is above the stack or 2) if we did see the same
>> performance issue I would be able to recreate internally here.
>>
>>
>>
>> I’m not as hopeful for the increasing the ring count size.  Increasing
>> this buffer would not be helpful if our descriptors aren’t being drain as
>> fast as data is coming in.  With TCP I might expect smaller rings and busty
>> traffic to lead to retransmits which would affect BW, but with UPD that
>> would only come into play based on the protocol above UDP.  Once again a
>> netperf test would help demonstrate if this was happening.  If you want to
>> try it you could hack up the driver and give it a shot.  I haven’t looked
>> in detail but it might be as simple as bumping up the IXGBE_MAX_RXD
>> define.  It would be at least a good place to start.
>>
>>
>>
>> Thanks,
>>
>> -Don
>>
>>
>>
>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
>> *Sent:* Wednesday, September 07, 2016 6:40 PM
>>
>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
>> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>;
>> e1000-devel@lists.sourceforge.net
>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G
>> SFPs UDP performance issue
>>
>>
>>
>> Hi Don,
>>
>>
>>
>> I have enough budget to run app in aggressive way - i.e. use many core to
>> deal with 1x10G card. If app is not draining socket data quick enough, I
>> would expect seeing UDP layer packet drop but I didn't.
>>
>>
>>
>> The reason I started tuning obviously is because I saw bad things - no
>> resource or sending out pause frame with default settings. For example, rx
>> ring size default is 512, I can see no dma resource even with 4G traffic
>> under 512 rings. If I move up to 4096, then I can see it can handle up to
>> 7-8 Gbps. I look into Intel 82599 controller data sheet, it appears to me
>> that chip level support up to 8192 ring, but driver not support it. I am
>> wondering if you guy can do something about it. May be that is the reason
>> why we are seeing no dma resource?
>>
>>
>>
>> I have great concern when we need to deal with 4x10G ports. Any thought?
>>
>>
>>
>>
>>
>> Hank
>>
>>
>>
>> On Wed, Sep 7, 2016 at 6:07 PM, Skidmore, Donald C <
>> donald.c.skidm...@intel.com> wrote:
>>
>> Hey Hank,
>>
>>
>>
>> I must of have misread your ethtool stats as I only noticed 2 in the list
>> and you clearly have 8 queues in play below.  Too much multitasking today
>> for me. :)
>>
>>
>>
>> That said it would still be nice with that many sockets as you could have
>> to be able to spread the load out to more CPU’s since it sounds like your
>> application would be capable of having enough threads to service them all.
>> This makes me think that attempting ATR would be in order, once again
>> assuming that the individual flows stay around long enough.  If nothing
>> else I would mess with the RSS key to hopefully get better spread with your
>> current spread it wouldn’t be useful to use more than 8 threads.
>>
>>
>>
>> It might also be worth thinking about if you are reaching some
>> application limit, for some reason it isn’t able to drain those queues as
>> fast as the data is coming in.  When you run say 8 parallel netperf UPD
>> sessions what kind of throughput do you see then?
>>
>>
>>
>> Thanks,
>>
>> -Don <donald.c.skidm...@intel.com>
>>
>>
>>
>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
>> *Sent:* Wednesday, September 07, 2016 5:42 PM
>>
>>
>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
>> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>;
>> e1000-devel@lists.sourceforge.net
>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G
>> SFPs UDP performance issue
>>
>>
>>
>> Hi Don,
>>
>>
>>
>> Below is snippet of the full log... How to know it's only go into 2
>> queues? I saw more than 2 queues has similar packets number... Can you
>> explain more?
>>
>>
>>
>> If it is two queues, would that imply 2 cores will handle 2 flows, right?
>> But,
>>
>> from watch -d -n1 cat /proc/interrupts, I can see interrupt rate increase
>> in the same order on those cores handling ethernet interrupts.
>>
>>
>>
>> About our traffic, basically same 34 Mbps stream sent into 240 multicast
>> address (225.82.10.0 - 225.82.10.119, 225.82.11.0 - 225.82.11.119).
>> Receiver opens up 240 socket to pull data out,check size, then toss it for
>> test purpose.
>>
>>
>>
>> Test application can do multiple threads as command line input. Each
>> thread will handle 240 / N connections, N is thread number. I don't see
>> much different in tern of behavior.
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Hank
>>
>> _packets: 1105903
>>
>>      rx_queue_0_bytes: 1501816274
>>
>>      rx_queue_0_bp_poll_yield: 0
>>
>>      rx_queue_0_bp_misses: 0
>>
>>      rx_queue_0_bp_cleaned: 0
>>
>>      rx_queue_1_packets: 1108639
>>
>>      rx_queue_1_bytes: 1505531762
>>
>>      rx_queue_1_bp_poll_yield: 0
>>
>>      rx_queue_1_bp_misses: 0
>>
>>      rx_queue_1_bp_cleaned: 0
>>
>>      rx_queue_2_packets: 0
>>
>>      rx_queue_2_bytes: 0
>>
>>      rx_queue_2_bp_poll_yield: 0
>>
>>      rx_queue_2_bp_misses: 0
>>
>>      rx_queue_2_bp_cleaned: 0
>>
>>      rx_queue_3_packets: 0
>>
>>      rx_queue_3_bytes: 0
>>
>>      rx_queue_3_bp_poll_yield: 0
>>
>>      rx_queue_3_bp_misses: 0
>>
>>      rx_queue_3_bp_cleaned: 0
>>
>>      rx_queue_4_packets: 1656985
>>
>>      rx_queue_4_bytes: 2250185630
>>
>>      rx_queue_4_bp_poll_yield: 0
>>
>>      rx_queue_4_bp_misses: 0
>>
>>      rx_queue_4_bp_cleaned: 0
>>
>>      rx_queue_5_packets: 1107023
>>
>>      rx_queue_5_bytes: 1503337234
>>
>>      rx_queue_5_bp_poll_yield: 0
>>
>>      rx_queue_5_bp_misses: 0
>>
>>      rx_queue_5_bp_cleaned: 0
>>
>>      rx_queue_6_packets: 0
>>
>>      rx_queue_6_bytes: 0
>>
>>      rx_queue_6_bp_poll_yield: 0
>>
>>      rx_queue_6_bp_misses: 0
>>
>>      rx_queue_6_bp_cleaned: 0
>>
>>      rx_queue_7_packets: 0
>>
>>      rx_queue_7_bytes: 0
>>
>>      rx_queue_7_bp_poll_yield: 0
>>
>>      rx_queue_7_bp_misses: 0
>>
>>      rx_queue_7_bp_cleaned: 0
>>
>>      rx_queue_8_packets: 0
>>
>>      rx_queue_8_bytes: 0
>>
>>      rx_queue_8_bp_poll_yield: 0
>>
>>      rx_queue_8_bp_misses: 0
>>
>>      rx_queue_8_bp_cleaned: 0
>>
>>      rx_queue_9_packets: 0
>>
>>      rx_queue_9_bytes: 0
>>
>>      rx_queue_9_bp_poll_yield: 0
>>
>>      rx_queue_9_bp_misses: 0
>>
>>      rx_queue_9_bp_cleaned: 0
>>
>>      rx_queue_10_packets: 1668431
>>
>>      rx_queue_10_bytes: 2265729298
>>
>>      rx_queue_10_bp_poll_yield: 0
>>
>>      rx_queue_10_bp_misses: 0
>>
>>      rx_queue_10_bp_cleaned: 0
>>
>>      rx_queue_11_packets: 1106051
>>
>>      rx_queue_11_bytes: 1502017258
>>
>>      rx_queue_11_bp_poll_yield: 0
>>
>>      rx_queue_11_bp_misses: 0
>>
>>      rx_queue_11_bp_cleaned: 0
>>
>>      rx_queue_12_packets: 0
>>
>>      rx_queue_12_bytes: 0
>>
>>      rx_queue_12_bp_poll_yield: 0
>>
>>      rx_queue_12_bp_misses: 0
>>
>>      rx_queue_12_bp_cleaned: 0
>>
>>      rx_queue_13_packets: 0
>>
>>      rx_queue_13_bytes: 0
>>
>>      rx_queue_13_bp_poll_yield: 0
>>
>>      rx_queue_13_bp_misses: 0
>>
>>      rx_queue_13_bp_cleaned: 0
>>
>>      rx_queue_14_packets: 1107157
>>
>>      rx_queue_14_bytes: 1503519206
>>
>>      rx_queue_14_bp_poll_yield: 0
>>
>>      rx_queue_14_bp_misses: 0
>>
>>      rx_queue_14_bp_cleaned: 0
>>
>>      rx_queue_15_packets: 1107574
>>
>>      rx_queue_15_bytes: 1504085492
>>
>>      rx_queue_15_bp_poll_yield: 0
>>
>>      rx_queue_15_bp_misses: 0
>>
>>      rx_queue_15_bp_cleaned: 0
>>
>>      rx_queue_16_packets: 0
>>
>>      rx_queue_16_bytes: 0
>>
>>      rx_queue_16_bp_poll_yield: 0
>>
>>      rx_queue_16_bp_misses: 0
>>
>>      rx_queue_1
>>
>>
>>
>> On Wed, Sep 7, 2016 at 5:04 PM, Skidmore, Donald C <
>> donald.c.skidm...@intel.com> wrote:
>>
>> Hey Hank,
>>
>>
>>
>> Well looks like all your traffic is just hashing to 2 queues.  You have
>> ATR enabled but it isn’t being used, do to this being UDP traffic.  That
>> isn’t a problem since RSS hash will occur on anything that doesn’t match
>> ATR (in your case everything).  All this means that you only have 2 flows
>> and thus all the work is being done with only two queues.  To get a better
>> hash spread you could modify the RSS hash key, but I would first look at
>> your traffic to see if you even have more than 2 flows operating.  Maybe
>> something can be done in the application to allow for more parallelism, run
>> for threads for instances (assuming each thread opens its own socket)?
>>
>>
>>
>> As for the rx_no_dma_resource counter it is tried in directly to one of
>> our HW counters.  It gets bumped if the target queue is disabled (unlikely
>> in your case) or there are no free descriptors in the target queue.  The
>> later makes sense since all of your traffic is going to just two queue that
>> appear to not be getting drained fast enough.
>>
>>
>>
>> Thanks,
>>
>> -Don <donald.c.skidm...@intel.com>
>>
>>
>>
>>
>>
>>
>>
>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
>> *Sent:* Wednesday, September 07, 2016 4:51 PM
>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
>> *Cc:* Rustad, Mark D <mark.d.rus...@intel.com>;
>> e1000-devel@lists.sourceforge.net
>>
>>
>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G
>> SFPs UDP performance issue
>>
>>
>>
>> Hi Don,
>>
>>
>>
>> I got log for you to look at. See attached...
>>
>>
>>
>> Thanks and let me know. BTW, can anyone tell me what could cause
>> rx_no_dma_resource?
>>
>>
>>
>> Hank
>>
>>
>>
>> On Wed, Sep 7, 2016 at 4:04 PM, Skidmore, Donald C <
>> donald.c.skidm...@intel.com> wrote:
>>
>> ATR is application targeted receive.  It may be useful for you but the
>> flow isn’t directed to a CPU until you transmit and since you mentioned you
>> don’t do much transmission it would have to be via the ACK’s.  Likewise the
>> flows will need to stick around for a while to gain any advantage from it.
>> Still it wouldn’t hurt to test using the ethtool command Alex mentioned in
>> another email.
>>
>>
>>
>> In general I would like to see you just go with the default of 16 RSS
>> queues and not attempt to mess with the affinization of the interrupt
>> vectors.  If the performance is still bad I would be interested how the
>> flows were being distributed between the queues.  You can see this via
>> packet counts per queue you get out of ethtool stats.  What I want to
>> eliminate is the possibility that RSS is seeing all your traffic as one
>> flow.
>>
>>
>>
>> Thanks,
>>
>> -Don <donald.c.skidm...@intel.com>
>>
>>
>>
>>
>>
>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
>> *Sent:* Wednesday, September 07, 2016 3:40 PM
>> *To:* Rustad, Mark D <mark.d.rus...@intel.com>
>> *Cc:* Skidmore, Donald C <donald.c.skidm...@intel.com>;
>> e1000-devel@lists.sourceforge.net
>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G
>> SFPs UDP performance issue
>>
>>
>>
>> Mark,
>>
>>
>>
>> Thanks!
>>
>>
>>
>> Test app can specify how many pthread to handle connections. I have tried
>> 4, 8, 16, etc, but none of them make significant difference. CPU usage on
>> receive end is moderate (50-60%). If I want to poll aggressively to prevent
>> any drop in UDP layer, then it might go up a bit. On the CPU set that
>> handle network interrupts, I did pin those CPUs, I can see interrupt rate
>> is pretty even on all CPUs involved.
>>
>>
>>
>> Since seeing a lot of rx_no_dma_resource and this counter is read out
>> through 82599 controller, I like to know why it happened. Note: I already
>> bumped rx ring size to maximum (4096) I can set in ethtool.
>>
>>
>>
>> BTW, what is ATR? I didn't set up any filter...
>>
>>
>>
>>
>>
>> Hank
>>
>>
>>
>> On Wed, Sep 7, 2016 at 2:19 PM, Rustad, Mark D <mark.d.rus...@intel.com>
>> wrote:
>>
>> Hank Liu <hank.tz...@gmail.com> wrote:
>>
>> *From:* Hank Liu [mailto:hank.tz...@gmail.com]
>> *Sent:* Wednesday, September 07, 2016 10:20 AM
>> *To:* Skidmore, Donald C <donald.c.skidm...@intel.com>
>> *Cc:* e1000-devel@lists.sourceforge.net
>> *Subject:* Re: [E1000-devel] Intel 82599 AXX10GBNIAIOM cards for 10G SFPs
>> UDP performance issue
>>
>>
>>
>> Thanks for quick response and helping. I guess I didn't make it clear is
>> that the application (receiver, sender) open 240 connections each
>> connection has 34 Mbps traffic.
>>
>>
>> You say that there are 240 connections, but how many threads is your app
>> using? One per connection? What does the cpu utilization look like on the
>> receiving end?
>>
>> Also, the current ATR implementation does not support UDP, so you are
>> probably better off not pinning the app threads at all and trusting that
>> the scheduler will migrate them to the cpu that is getting their packets
>> via RSS. You should still set the affinity of the interrupts in that case.
>> The default number of queues should be fine.
>>
>> --
>> Mark Rustad, Networking Division, Intel Corporation
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to