Okay, I'll just paste the bits here that I think are relevant. Specifically the symbols that are at or above 0.5% CPU utilization.
# Overhead sys usr Command Shared Object Symbol # ........ ........ ........ .............. ....................... ................................................................ # 12.91% 12.91% 0.00% swapper [kernel.kallsyms] [k] tpacket_rcv 11.22% 11.22% 0.00% swapper [kernel.kallsyms] [k] memcpy_erms 4.03% 0.00% 4.03% W#01-p1p1 libhs.so.4.2.0 [.] fdr_engine_exec 3.17% 3.17% 0.00% W#01-p1p1 [kernel.kallsyms] [k] tpacket_rcv 3.10% 3.10% 0.00% W#01-p1p1 [kernel.kallsyms] [k] memcpy_erms 2.65% 2.65% 0.00% swapper [kernel.kallsyms] [k] __netif_receive_skb_core 2.61% 0.00% 2.61% W#01-p1p1 libhs.so.4.2.0 [.] nfaExecMcClellan16_B 2.41% 2.41% 0.00% swapper [kernel.kallsyms] [k] ixgbe_clean_rx_irq 2.40% 2.40% 0.00% swapper [kernel.kallsyms] [k] mwait_idle 1.91% 0.00% 1.91% W#01-p1p1 libc-2.19.so [.] memset 1.52% 1.52% 0.00% swapper [kernel.kallsyms] [k] consume_skb 1.29% 1.29% 0.00% swapper [kernel.kallsyms] [k] __skb_get_hash 1.18% 1.18% 0.00% swapper [kernel.kallsyms] [k] prb_fill_curr_block.isra.59 1.09% 1.09% 0.00% swapper [kernel.kallsyms] [k] __skb_flow_dissect 1.06% 1.06% 0.00% swapper [kernel.kallsyms] [k] __build_skb 1.04% 1.04% 0.00% swapper [kernel.kallsyms] [k] packet_rcv 0.89% 0.89% 0.00% swapper [kernel.kallsyms] [k] irq_entries_start 0.82% 0.82% 0.00% ksoftirqd/0 [kernel.kallsyms] [k] memcpy_erms 0.78% 0.78% 0.00% ksoftirqd/0 [kernel.kallsyms] [k] tpacket_rcv 0.72% 0.72% 0.00% swapper [kernel.kallsyms] [k] skb_copy_bits 0.71% 0.00% 0.71% W#01-p1p1 suricata [.] SigMatchSignatures 0.69% 0.00% 0.69% W#01-p1p1 libc-2.19.so [.] malloc 0.66% 0.66% 0.00% W#01-p1p1 [kernel.kallsyms] [k] ixgbe_clean_rx_irq 0.63% 0.63% 0.00% W#01-p1p1 [kernel.kallsyms] [k] __netif_receive_skb_core 0.51% 0.51% 0.00% swapper [kernel.kallsyms] [k] kfree_skb So looking over what you sent me it doesn't looks so much like this is a driver issue as the kernel overhead for processing these frames is pretty significant with at least something like 25% of the CPU time being spent handling tpacket_rcv or a memcpy in order to service tpacket_rcv. I haven't had much experience with Suricata but you might want to try checking with experts on that if you haven't already as it seems like some significant CPU time is getting consumed in the kernel/userspace handoff. If nothing else you might try bringing up questions on how to improve raw socket performance on the netdev mailing list. I just remembered that you disabled RSS. That is the reason why you are not seeing any rx_no_dma_resources errors. In order for packets to be dropped per ring you have to have more than 1 ring enabled. I did some quick googling on why Suricata might not support RSS and I guess it has to do with Tx and Rx traffic not ending up on the same queue. That is actually pretty easy to fix. All you would need to do is pass the module parameter ATR=0 in order to disable ATR and change the RSS key on the device to use a 16 bit repeating value. You can find a paper detailing some of that here: http://www.ndsl.kaist.edu/~kyoungsoo/papers/TR-symRSS.pdf Other than these tips I don't know if there is much more info I can provide. It looks like you will need to add more CPU power in order to be able to handle the load as you are currently maxing out the one thread you are using. - Alex On Sun, Sep 25, 2016 at 4:47 PM, Michał Purzyński <michalpurzyns...@gmail.com> wrote: > > Sent off list, because files are around a MB. > > On Mon, Sep 26, 2016 at 1:28 AM, Alexander Duyck <alexander.du...@gmail.com> > wrote: >> >> If you can just send me the output from "perf report" it would be more >> useful. The problem is the raw data you sent me doesn't do me any good >> without the symbol tables and such and those would be too large to be >> sending over email. >> >> What I am basically looking for is a dump with the symbol names that are >> taking up the CPU time. From there I can probably start to understand what >> is going on. >> >> - Alex >> >> >> On Sun, Sep 25, 2016 at 4:09 PM, Michał Purzyński >> <michalpurzyns...@gmail.com> wrote: >>> >>> perf record (and perf top) shows interesting results indeed. For one, there >>> was some lock function called with _slowpath_ in name, with perf top -g >>> quickly traced to cpufreq and I ended up setting performance governor and >>> that slowpath call is gone now. >>> >>> Some rx_missed are still here. Much less but traffic is also far from what >>> it is on weekdays. Below you will find links to perf.data and a results of >>> perf script -D (let me know if I got it wrong) >>> >>> https://drive.google.com/file/d/0B4XJBHc9i84dRXU5eE5FRFBsVUU/view?usp=sharing >>> https://drive.google.com/file/d/0B4XJBHc9i84dd2ZSREUtN2Z4dDQ/view?usp=sharing >>> >>> I made triple sure that VTd is disabled so IOMMU is gone with it, from day >>> one I received this server. >>> >>> >>> On Sun, Sep 25, 2016 at 8:21 PM, Alexander Duyck >>> <alexander.du...@gmail.com> wrote: >>>> >>>> You probably don't need to bother with disabling any other prefetchers or >>>> anything like that. >>>> >>>> One thing that did occur to me is that when you are running your test you >>>> might try to capture a perf trace on the core that the interrupt is >>>> running on. All you need to do to capture that is just run perf record -C >>>> <cpu num> sleep 20 while your test is running. Then dump perf report to a >>>> logfile of your choice and send us the results. That should help us to >>>> identify any hot spots that might be eating any extra CPU time. >>>> >>>> Also when you are in the BIOS you might try looking to see if you have an >>>> IOMMU or VTd feature enabled. If you do you might want to try disabling >>>> it to see if that gives you any performance boost. If so you could try >>>> booting with the kernel parameter iommu=pt which should switch the system >>>> over to identity mapping the device onto the system which would save you >>>> some considerable time. >>>> >>>> >>>> On Sun, Sep 25, 2016 at 9:40 AM, Michał Purzyński >>>> <michalpurzyns...@gmail.com> wrote: >>>>> >>>>> Yes, I have all kinds of offloads disabled. I'll ask HP to provide a >>>>> detailed connections scheme, the one they are avoiding so much in the >>>>> server manual. >>>>> >>>>> Super micro publishes it all. Go figure. >>>>> >>>>> Btw how should the prefetching be configured to not interfere with DCA? >>>>> >>>>> Here's what I have. Should I disable HW prefetcher and Adjacent Sector >>>>> Prefetch? Anything more? >>>>> >>>>> >>>>> >>>>> On 25 Sep 2016, at 03:55, Alexander Duyck <alexander.du...@gmail.com> >>>>> wrote: >>>>> >>>>> On Sat, Sep 24, 2016 at 4:40 PM, Michał Purzyński >>>>> <michalpurzyns...@gmail.com> wrote: >>>>> >>>>> Thank for you being persistent with answers. >>>>> >>>>> >>>>> So right after sending previous email I noticed that I left over some >>>>> >>>>> careless IRQ assignments after experimenting with IRQ and process CPU >>>>> >>>>> affinity. Both cards were hitting the same core, which (for second card) >>>>> was >>>>> >>>>> on a different NUMA node, plus that core was saturated. >>>>> >>>>> >>>>> Result was around 38% packets lost, calculated by comparing packets >>>>> received >>>>> >>>>> with rx_missed. It's interesting that no other counter was increasing. >>>>> >>>>> >>>>> Right now I have moved card 0 to core 0 and card 1 to first core from the >>>>> >>>>> second CPU. >>>>> >>>>> >>>>> Now the rx_missed is around 6-7% for each card. Still way too much. >>>>> >>>>> >>>>> I send a total of 8-11Gbit/sec to both cars total, so each receives around >>>>> >>>>> half of that. Packet rate is 1.2Mpps top (also total). All kinds of packet >>>>> >>>>> sizes. >>>>> >>>>> >>>>> So if you are doing packet analysis I assume you don't need LRO or >>>>> GRO. If not you may want to look into disabling them via "ethtool >>>>> -K". I know RSC can sometimes cause packet drops due to aggregating a >>>>> number of frames before finally submitting them to the device. >>>>> Although that usually required ASPM to be enabled as well. >>>>> >>>>> I'll lower rings to 512 as the next step. Good to know about card's >>>>> >>>>> limitations. >>>>> >>>>> >>>>> Given that InterruptThrottleRate has to be given in a 'number of >>>>> interrupts >>>>> >>>>> / second' what would you recommend I set it to, for a start at least? I >>>>> have >>>>> >>>>> a 2.6Ghz Xeon E5 v3. >>>>> >>>>> >>>>> So I would recommend a value no less than 12500 for >>>>> InterruptThrottleRate. Assuming a reasonable packet rate that should >>>>> give you a decent trade of in terms of performance versus latency. >>>>> >>>>> I'll buy a pair of X710 for a test as well. It will be an interesting >>>>> >>>>> comparison. Who knows, maybe the RSS implementation and MQ there is good >>>>> >>>>> enough for IDS to be used. >>>>> >>>>> >>>>> Fortunately I don't run it inline, this server receives a copy of traffic. >>>>> >>>>> >>>>> I'm not sure if it will get you much more throughput or not. I still >>>>> find it odd that your dropping packets even though the device isn't >>>>> complaining about not having ring buffer resources. Usually that >>>>> points to a bottleneck somewhere in the PCIe bus. You might want to >>>>> double check and verify that the devices are connected directly to the >>>>> root complex and not some secondary bus on a PCIe switch that is >>>>> actually downgrading the link between the device and the CPU socket. >>>>> >>>>> On Sat, Sep 24, 2016 at 3:20 AM, Alexander Duyck >>>>> <alexander.du...@gmail.com> >>>>> >>>>> wrote: >>>>> >>>>> >>>>> Well as a general rule anything over about 80usecs for >>>>> >>>>> InterruptThrottleRate is a waste. One advantage to reducing the >>>>> >>>>> interrupt throttle rate is you can reduce the ring size and you might >>>>> >>>>> see a slight performance improvement. One problem with using 4096 >>>>> >>>>> descriptors is that it greatly increases the cache footprint and leads >>>>> >>>>> to more buffer-bloat and cache thrash as you have to evict old >>>>> >>>>> descriptors to pull in new ones. I'm also sure if you are doing an >>>>> >>>>> intrusion detection system (I'm assuming that is what IDS is in >>>>> >>>>> reference to), then the users would appreciate it if you didn't add up >>>>> >>>>> to a half dozen extra milliseconds of latency to their network (worst >>>>> >>>>> case with an elephant flow of 1514 byte frames). >>>>> >>>>> >>>>> What size packets is it you are working with? One limitation of the >>>>> >>>>> 82599 is that it can only handle an upper limit of somewhere around >>>>> >>>>> 12Mpps if you are using something like 6 queues, and only a little >>>>> >>>>> over 2 for a single queue. If you exceed 12Mpps then the part will >>>>> >>>>> start reporting rx_missed because the PCIe overhead for moving 64 byte >>>>> >>>>> packets is great enough that it actually causes us to exceed the >>>>> >>>>> limits of the x8 gen2 link. If the memcpy is what I think it is then >>>>> >>>>> it allows us to avoid having to do two different atomic operations >>>>> >>>>> that would have been more expensive otherwise. >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 12:46 PM, Michał Purzyński >>>>> >>>>> <michalpurzyns...@gmail.com> wrote: >>>>> >>>>> Here's what I did >>>>> >>>>> >>>>> ethtool -A p1p1 rx off tx off >>>>> >>>>> ethtool -A p3p1 rx off tx off >>>>> >>>>> >>>>> Both ethtool -a <interface> and Arista that's pumping data show that >>>>> >>>>> RX/TX >>>>> >>>>> pause are disabled. >>>>> >>>>> >>>>> I have two cards, each connected to a separate NUMA node, threads >>>>> >>>>> pinned, >>>>> >>>>> etc. >>>>> >>>>> >>>>> One non-standard thing is that I use single queue only, because any form >>>>> >>>>> of >>>>> >>>>> multiqueue leads to packet reordering and confuses IDS. An issue that's >>>>> >>>>> been >>>>> >>>>> hidden for a while in the NSM community. >>>>> >>>>> >>>>> driver (from sourceforge) was loaded with MQ=0 DCA=2 RSS=1 VMDQ=0 >>>>> >>>>> InterruptThrottleRate=956 FCoE=0 LRO=0 vxvlan_rx=0 (each option's value >>>>> >>>>> given enogh times so it applies to all cards in this system). >>>>> >>>>> >>>>> I could see the same issue sending traffic to just one card. >>>>> >>>>> >>>>> Of course a single core is swamped with ACK-ing hardware IRQ and then >>>>> >>>>> doing >>>>> >>>>> softIRQ (which seems to be mostly memcpy?). But then again, I don't see >>>>> >>>>> errors about lacking buffers (I run with 4096 descriptors). >>>>> >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 9:22 PM, Alexander Duyck >>>>> >>>>> <alexander.du...@gmail.com> >>>>> >>>>> wrote: >>>>> >>>>> >>>>> When you say you disabled flow control did you disable it on the >>>>> >>>>> interface that is dropping packets or the other end? You might try >>>>> >>>>> explicitly disabling it on the interface that is dropping packets, >>>>> >>>>> that in turn should enable per-queue drop instead of putting >>>>> >>>>> back-pressure onto the Rx FIFO. >>>>> >>>>> >>>>> With flow control disabled on the local port you should see >>>>> >>>>> rx_no_dma_resources start incrementing if the issue is that one of the >>>>> >>>>> Rx rings is not keeping up. >>>>> >>>>> >>>>> - Alex >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 11:09 AM, Michał Purzyński >>>>> >>>>> <michalpurzyns...@gmail.com> wrote: >>>>> >>>>> xoff was increasing so I disabled flow control. >>>>> >>>>> >>>>> That's a HP DL360 Gen9 and lspci -vvv tells me cards are connected to >>>>> >>>>> x8 >>>>> >>>>> link, speed is 5GT/s and ASPM is disabled. >>>>> >>>>> >>>>> Other error counters are still zero. When I compared rx_packets and >>>>> >>>>> rx_missed_errors it looks like a 38% (!!) packets are getting lost. >>>>> >>>>> >>>>> Unfortunately HP documentation is a scam and they actively avoid >>>>> >>>>> publishing >>>>> >>>>> motherboard layout. >>>>> >>>>> >>>>> Any other place I could look for hints? >>>>> >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 7:01 PM, Alexander Duyck >>>>> >>>>> <alexander.du...@gmail.com> >>>>> >>>>> wrote: >>>>> >>>>> >>>>> On Fri, Sep 23, 2016 at 1:10 AM, Michał Purzyński >>>>> >>>>> <michalpurzyns...@gmail.com> wrote: >>>>> >>>>> Hello. >>>>> >>>>> >>>>> On my IDS workload with af_packet I can see rx_missed_errors >>>>> >>>>> growing >>>>> >>>>> while >>>>> >>>>> rx_no_buffer_count does not. Basically every other kind of rx_ >>>>> >>>>> error >>>>> >>>>> counter is 0, including rx_no_dma_resources. It's an 82599 based >>>>> >>>>> card. >>>>> >>>>> >>>>> I don't know what to think about that. I went through ixgbe source >>>>> >>>>> code >>>>> >>>>> and >>>>> >>>>> the 82599 datasheet and seems like rx_missed_error means a new >>>>> >>>>> packet >>>>> >>>>> overwrote something already in the packet buffer (FIFO queue on >>>>> >>>>> the >>>>> >>>>> card) >>>>> >>>>> because there was no more space in it. >>>>> >>>>> >>>>> Now, that would happen if there is no place to DMA packets into - >>>>> >>>>> but >>>>> >>>>> that >>>>> >>>>> counter does not grow. >>>>> >>>>> >>>>> Could you point me to where should I be looking for a problem? >>>>> >>>>> >>>>> -- >>>>> >>>>> Michal Purzynski >>>>> >>>>> >>>>> The Rx missed count will increment if you are not able to receive a >>>>> >>>>> packet because the Rx FIFO is full. If you are not seeing any >>>>> >>>>> rx_no_dma_resources problems it might indicate that the problem is >>>>> >>>>> not >>>>> >>>>> with providing the DMA resources, but a problem on the bus itself. >>>>> >>>>> You might want to double check the slot the device is connected to >>>>> >>>>> in >>>>> >>>>> order to guarantee that there is a x8 link that supports 5GT/s all >>>>> >>>>> the >>>>> >>>>> way through to the root complex. >>>>> >>>>> >>>>> - Alex >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>> >> > ------------------------------------------------------------------------------ _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired