Yes, I have all kinds of offloads disabled. I'll ask HP to provide a detailed connections scheme, the one they are avoiding so much in the server manual.
Super micro publishes it all. Go figure. Btw how should the prefetching be configured to not interfere with DCA? Here's what I have. Should I disable HW prefetcher and Adjacent Sector Prefetch? Anything more? > On 25 Sep 2016, at 03:55, Alexander Duyck <alexander.du...@gmail.com> wrote: > > On Sat, Sep 24, 2016 at 4:40 PM, Michał Purzyński > <michalpurzyns...@gmail.com> wrote: >> Thank for you being persistent with answers. >> >> So right after sending previous email I noticed that I left over some >> careless IRQ assignments after experimenting with IRQ and process CPU >> affinity. Both cards were hitting the same core, which (for second card) was >> on a different NUMA node, plus that core was saturated. >> >> Result was around 38% packets lost, calculated by comparing packets received >> with rx_missed. It's interesting that no other counter was increasing. >> >> Right now I have moved card 0 to core 0 and card 1 to first core from the >> second CPU. >> >> Now the rx_missed is around 6-7% for each card. Still way too much. >> >> I send a total of 8-11Gbit/sec to both cars total, so each receives around >> half of that. Packet rate is 1.2Mpps top (also total). All kinds of packet >> sizes. > > So if you are doing packet analysis I assume you don't need LRO or > GRO. If not you may want to look into disabling them via "ethtool > -K". I know RSC can sometimes cause packet drops due to aggregating a > number of frames before finally submitting them to the device. > Although that usually required ASPM to be enabled as well. > >> I'll lower rings to 512 as the next step. Good to know about card's >> limitations. >> >> Given that InterruptThrottleRate has to be given in a 'number of interrupts >> / second' what would you recommend I set it to, for a start at least? I have >> a 2.6Ghz Xeon E5 v3. > > So I would recommend a value no less than 12500 for > InterruptThrottleRate. Assuming a reasonable packet rate that should > give you a decent trade of in terms of performance versus latency. > >> I'll buy a pair of X710 for a test as well. It will be an interesting >> comparison. Who knows, maybe the RSS implementation and MQ there is good >> enough for IDS to be used. >> >> Fortunately I don't run it inline, this server receives a copy of traffic. > > I'm not sure if it will get you much more throughput or not. I still > find it odd that your dropping packets even though the device isn't > complaining about not having ring buffer resources. Usually that > points to a bottleneck somewhere in the PCIe bus. You might want to > double check and verify that the devices are connected directly to the > root complex and not some secondary bus on a PCIe switch that is > actually downgrading the link between the device and the CPU socket. > >> On Sat, Sep 24, 2016 at 3:20 AM, Alexander Duyck <alexander.du...@gmail.com> >> wrote: >>> >>> Well as a general rule anything over about 80usecs for >>> InterruptThrottleRate is a waste. One advantage to reducing the >>> interrupt throttle rate is you can reduce the ring size and you might >>> see a slight performance improvement. One problem with using 4096 >>> descriptors is that it greatly increases the cache footprint and leads >>> to more buffer-bloat and cache thrash as you have to evict old >>> descriptors to pull in new ones. I'm also sure if you are doing an >>> intrusion detection system (I'm assuming that is what IDS is in >>> reference to), then the users would appreciate it if you didn't add up >>> to a half dozen extra milliseconds of latency to their network (worst >>> case with an elephant flow of 1514 byte frames). >>> >>> What size packets is it you are working with? One limitation of the >>> 82599 is that it can only handle an upper limit of somewhere around >>> 12Mpps if you are using something like 6 queues, and only a little >>> over 2 for a single queue. If you exceed 12Mpps then the part will >>> start reporting rx_missed because the PCIe overhead for moving 64 byte >>> packets is great enough that it actually causes us to exceed the >>> limits of the x8 gen2 link. If the memcpy is what I think it is then >>> it allows us to avoid having to do two different atomic operations >>> that would have been more expensive otherwise. >>> >>> On Fri, Sep 23, 2016 at 12:46 PM, Michał Purzyński >>> <michalpurzyns...@gmail.com> wrote: >>>> Here's what I did >>>> >>>> ethtool -A p1p1 rx off tx off >>>> ethtool -A p3p1 rx off tx off >>>> >>>> Both ethtool -a <interface> and Arista that's pumping data show that >>>> RX/TX >>>> pause are disabled. >>>> >>>> I have two cards, each connected to a separate NUMA node, threads >>>> pinned, >>>> etc. >>>> >>>> One non-standard thing is that I use single queue only, because any form >>>> of >>>> multiqueue leads to packet reordering and confuses IDS. An issue that's >>>> been >>>> hidden for a while in the NSM community. >>>> >>>> driver (from sourceforge) was loaded with MQ=0 DCA=2 RSS=1 VMDQ=0 >>>> InterruptThrottleRate=956 FCoE=0 LRO=0 vxvlan_rx=0 (each option's value >>>> given enogh times so it applies to all cards in this system). >>>> >>>> I could see the same issue sending traffic to just one card. >>>> >>>> Of course a single core is swamped with ACK-ing hardware IRQ and then >>>> doing >>>> softIRQ (which seems to be mostly memcpy?). But then again, I don't see >>>> errors about lacking buffers (I run with 4096 descriptors). >>>> >>>> >>>> On Fri, Sep 23, 2016 at 9:22 PM, Alexander Duyck >>>> <alexander.du...@gmail.com> >>>> wrote: >>>>> >>>>> When you say you disabled flow control did you disable it on the >>>>> interface that is dropping packets or the other end? You might try >>>>> explicitly disabling it on the interface that is dropping packets, >>>>> that in turn should enable per-queue drop instead of putting >>>>> back-pressure onto the Rx FIFO. >>>>> >>>>> With flow control disabled on the local port you should see >>>>> rx_no_dma_resources start incrementing if the issue is that one of the >>>>> Rx rings is not keeping up. >>>>> >>>>> - Alex >>>>> >>>>> On Fri, Sep 23, 2016 at 11:09 AM, Michał Purzyński >>>>> <michalpurzyns...@gmail.com> wrote: >>>>>> xoff was increasing so I disabled flow control. >>>>>> >>>>>> That's a HP DL360 Gen9 and lspci -vvv tells me cards are connected to >>>>>> x8 >>>>>> link, speed is 5GT/s and ASPM is disabled. >>>>>> >>>>>> Other error counters are still zero. When I compared rx_packets and >>>>>> rx_missed_errors it looks like a 38% (!!) packets are getting lost. >>>>>> >>>>>> Unfortunately HP documentation is a scam and they actively avoid >>>>>> publishing >>>>>> motherboard layout. >>>>>> >>>>>> Any other place I could look for hints? >>>>>> >>>>>> >>>>>> On Fri, Sep 23, 2016 at 7:01 PM, Alexander Duyck >>>>>> <alexander.du...@gmail.com> >>>>>> wrote: >>>>>>> >>>>>>> On Fri, Sep 23, 2016 at 1:10 AM, Michał Purzyński >>>>>>> <michalpurzyns...@gmail.com> wrote: >>>>>>>> Hello. >>>>>>>> >>>>>>>> On my IDS workload with af_packet I can see rx_missed_errors >>>>>>>> growing >>>>>>>> while >>>>>>>> rx_no_buffer_count does not. Basically every other kind of rx_ >>>>>>>> error >>>>>>>> counter is 0, including rx_no_dma_resources. It's an 82599 based >>>>>>>> card. >>>>>>>> >>>>>>>> I don't know what to think about that. I went through ixgbe source >>>>>>>> code >>>>>>>> and >>>>>>>> the 82599 datasheet and seems like rx_missed_error means a new >>>>>>>> packet >>>>>>>> overwrote something already in the packet buffer (FIFO queue on >>>>>>>> the >>>>>>>> card) >>>>>>>> because there was no more space in it. >>>>>>>> >>>>>>>> Now, that would happen if there is no place to DMA packets into - >>>>>>>> but >>>>>>>> that >>>>>>>> counter does not grow. >>>>>>>> >>>>>>>> Could you point me to where should I be looking for a problem? >>>>>>>> >>>>>>>> -- >>>>>>>> Michal Purzynski >>>>>>> >>>>>>> The Rx missed count will increment if you are not able to receive a >>>>>>> packet because the Rx FIFO is full. If you are not seeing any >>>>>>> rx_no_dma_resources problems it might indicate that the problem is >>>>>>> not >>>>>>> with providing the DMA resources, but a problem on the bus itself. >>>>>>> You might want to double check the slot the device is connected to >>>>>>> in >>>>>>> order to guarantee that there is a x8 link that supports 5GT/s all >>>>>>> the >>>>>>> way through to the root complex. >>>>>>> >>>>>>> - Alex >>>>>> >>>>>> >>>> >>>> >> >>
------------------------------------------------------------------------------
_______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired