Re: [E1000-devel] rx_missed_errors grows while rx_no_buffer does not

Michał Purzyński Mon, 26 Sep 2016 06:55:57 -0700

Thank you a lot! I think there's a value in making sure the
driver/card/bios/kernel level is tuned correctly. I have learned a lot in
the process.


A gold standard for Suricata configuration will follow, so that knowledge
is not forgotten.

I'm in touch with Suricata developers, this whole thread started with me
being surprised that despite growing rx_missed_error (and well, growing to
like 38% of all packets received) there was no DMA errors or anything like
that.

Now we are at 1% of rx_missed and at least know how to troubleshoot it.
Excellent.

What helped most was making sure that each card sends interrupts to a
separate CPUs, NUMA configuration is correct, processes are pinned and
cpufreq governor is set to performance.
Smaller things like disabling ASPM (so PCIe does not go away from under the
card in the worst moment) and keeping CPU somewhere near C0/C1 also helped.


The af_packet copies data from skbuff so there's technically two memcpy()
if I understand that correctly - driver buffers -> skbuff -> af_packet
buffers

Later af_packet buffers are mapped into userspace so no additional data
copying occurs.

I'll ask if that can be even more optimized.


Need to find some time to check out the changed RSS hash key and see how it
performs, also in terms of packet reordering. There were two problems with
RSS:

1. non-symmetric hash (easy to change)
2. packet reordering introduced with RSS, even when hash is symmetric

What does the ATR=0 do and why is it necessary?

Also, that's a last question in this series. Thank you a lot, Suricata
community really appreciates Intel's help :-)


On Mon, Sep 26, 2016 at 3:46 AM, Alexander Duyck <alexander.du...@gmail.com>
wrote:

> Okay, I'll just paste the bits here that I think are relevant.
> Specifically the symbols that are at or above 0.5% CPU utilization.
>
> # Overhead       sys       usr  Command         Shared Object
>   Symbol
> # ........  ........  ........  ..............
> .......................
> ................................................................
> #
>     12.91%    12.91%     0.00%  swapper         [kernel.kallsyms]
>   [k] tpacket_rcv
>     11.22%    11.22%     0.00%  swapper         [kernel.kallsyms]
>   [k] memcpy_erms
>      4.03%     0.00%     4.03%  W#01-p1p1       libhs.so.4.2.0
>   [.] fdr_engine_exec
>      3.17%     3.17%     0.00%  W#01-p1p1       [kernel.kallsyms]
>   [k] tpacket_rcv
>      3.10%     3.10%     0.00%  W#01-p1p1       [kernel.kallsyms]
>   [k] memcpy_erms
>      2.65%     2.65%     0.00%  swapper         [kernel.kallsyms]
>   [k] __netif_receive_skb_core
>      2.61%     0.00%     2.61%  W#01-p1p1       libhs.so.4.2.0
>   [.] nfaExecMcClellan16_B
>      2.41%     2.41%     0.00%  swapper         [kernel.kallsyms]
>   [k] ixgbe_clean_rx_irq
>      2.40%     2.40%     0.00%  swapper         [kernel.kallsyms]
>   [k] mwait_idle
>      1.91%     0.00%     1.91%  W#01-p1p1       libc-2.19.so
>   [.] memset
>      1.52%     1.52%     0.00%  swapper         [kernel.kallsyms]
>   [k] consume_skb
>      1.29%     1.29%     0.00%  swapper         [kernel.kallsyms]
>   [k] __skb_get_hash
>      1.18%     1.18%     0.00%  swapper         [kernel.kallsyms]
>   [k] prb_fill_curr_block.isra.59
>      1.09%     1.09%     0.00%  swapper         [kernel.kallsyms]
>   [k] __skb_flow_dissect
>      1.06%     1.06%     0.00%  swapper         [kernel.kallsyms]
>   [k] __build_skb
>      1.04%     1.04%     0.00%  swapper         [kernel.kallsyms]
>   [k] packet_rcv
>      0.89%     0.89%     0.00%  swapper         [kernel.kallsyms]
>   [k] irq_entries_start
>      0.82%     0.82%     0.00%  ksoftirqd/0     [kernel.kallsyms]
>   [k] memcpy_erms
>      0.78%     0.78%     0.00%  ksoftirqd/0     [kernel.kallsyms]
>   [k] tpacket_rcv
>      0.72%     0.72%     0.00%  swapper         [kernel.kallsyms]
>   [k] skb_copy_bits
>      0.71%     0.00%     0.71%  W#01-p1p1       suricata
>   [.] SigMatchSignatures
>      0.69%     0.00%     0.69%  W#01-p1p1       libc-2.19.so
>   [.] malloc
>      0.66%     0.66%     0.00%  W#01-p1p1       [kernel.kallsyms]
>   [k] ixgbe_clean_rx_irq
>      0.63%     0.63%     0.00%  W#01-p1p1       [kernel.kallsyms]
>   [k] __netif_receive_skb_core
>      0.51%     0.51%     0.00%  swapper         [kernel.kallsyms]
>   [k] kfree_skb
>
> So looking over what you sent me it doesn't looks so much like this is
> a driver issue as the kernel overhead for processing these frames is
> pretty significant with at least something like 25% of the CPU time
> being spent handling tpacket_rcv or a memcpy in order to service
> tpacket_rcv.  I haven't had much experience with Suricata but you
> might want to try checking with experts on that if you haven't already
> as it seems like some significant CPU time is getting consumed in the
> kernel/userspace handoff.  If nothing else you might try bringing up
> questions on how to improve raw socket performance on the netdev
> mailing list.
>
> I just remembered that you disabled RSS.  That is the reason why you
> are not seeing any rx_no_dma_resources errors.  In order for packets
> to be dropped per ring you have to have more than 1 ring enabled.  I
> did some quick googling on why Suricata might not support RSS and I
> guess it has to do with Tx and Rx traffic not ending up on the same
> queue.  That is actually pretty easy to fix.  All you would need to do
> is pass the module parameter ATR=0 in order to disable ATR and change
> the RSS key on the device to use a 16 bit repeating value.  You can
> find a paper detailing some of that here:
> http://www.ndsl.kaist.edu/~kyoungsoo/papers/TR-symRSS.pdf
>
> Other than these tips I don't know if there is much more info I can
> provide.  It looks like you will need to add more CPU power in order
> to be able to handle the load as you are currently maxing out the one
> thread you are using.
>
> - Alex
>
> On Sun, Sep 25, 2016 at 4:47 PM, Michał Purzyński
> <michalpurzyns...@gmail.com> wrote:
> >
> > Sent off list, because files are around a MB.
> >
> > On Mon, Sep 26, 2016 at 1:28 AM, Alexander Duyck <
> alexander.du...@gmail.com> wrote:
> >>
> >> If you can just send me the output from "perf report" it would be more
> useful.  The problem is the raw data you sent me doesn't do me any good
> without the symbol tables and such and those would be too large to be
> sending over email.
> >>
> >> What I am basically looking for is a dump with the symbol names that
> are taking up the CPU time.  From there I can probably start to understand
> what is going on.
> >>
> >> - Alex
> >>
> >>
> >> On Sun, Sep 25, 2016 at 4:09 PM, Michał Purzyński <
> michalpurzyns...@gmail.com> wrote:
> >>>
> >>> perf record (and perf top) shows interesting results indeed. For one,
> there was some lock function called with _slowpath_ in name, with perf top
> -g quickly traced to cpufreq and I ended up setting performance governor
> and that slowpath call is gone now.
> >>>
> >>> Some rx_missed are still here. Much less but traffic is also far from
> what it is on weekdays. Below you will find links to perf.data and a
> results of perf script -D (let me know if I got it wrong)
> >>>
> >>> https://drive.google.com/file/d/0B4XJBHc9i84dRXU5eE5FRFBsVUU/
> view?usp=sharing
> >>> https://drive.google.com/file/d/0B4XJBHc9i84dd2ZSREUtN2Z4dDQ/
> view?usp=sharing
> >>>
> >>> I made triple sure that VTd is disabled so IOMMU is gone with it, from
> day one I received this server.
> >>>
> >>>
> >>> On Sun, Sep 25, 2016 at 8:21 PM, Alexander Duyck <
> alexander.du...@gmail.com> wrote:
> >>>>
> >>>> You probably don't need to bother with disabling any other
> prefetchers or anything like that.
> >>>>
> >>>> One thing that did occur to me is that when you are running your test
> you might try to capture a perf trace on the core that the interrupt is
> running on.  All you need to do to capture that is just run perf record -C
> <cpu num> sleep 20 while your test is running.  Then dump perf report to a
> logfile of your choice and send us the results. That should help us to
> identify any hot spots that might be eating any extra CPU time.
> >>>>
> >>>> Also when you are in the BIOS you might try looking to see if you
> have an IOMMU or VTd feature enabled.  If you do you might want to try
> disabling it to see if that gives you any performance boost.  If so you
> could try booting with the kernel parameter iommu=pt which should switch
> the system over to identity mapping the device onto the system which would
> save you some considerable time.
> >>>>
> >>>>
> >>>> On Sun, Sep 25, 2016 at 9:40 AM, Michał Purzyński <
> michalpurzyns...@gmail.com> wrote:
> >>>>>
> >>>>> Yes, I have all kinds of offloads disabled. I'll ask HP to provide a
> detailed connections scheme, the one they are avoiding so much in the
> server manual.
> >>>>>
> >>>>> Super micro publishes it all. Go figure.
> >>>>>
> >>>>> Btw how should the prefetching be configured to not interfere with
> DCA?
> >>>>>
> >>>>> Here's what I have. Should I disable HW prefetcher and Adjacent
> Sector Prefetch? Anything more?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 25 Sep 2016, at 03:55, Alexander Duyck <alexander.du...@gmail.com>
> wrote:
> >>>>>
> >>>>> On Sat, Sep 24, 2016 at 4:40 PM, Michał Purzyński
> >>>>> <michalpurzyns...@gmail.com> wrote:
> >>>>>
> >>>>> Thank for you being persistent with answers.
> >>>>>
> >>>>>
> >>>>> So right after sending previous email I noticed that I left over some
> >>>>>
> >>>>> careless IRQ assignments after experimenting with IRQ and process CPU
> >>>>>
> >>>>> affinity. Both cards were hitting the same core, which (for second
> card) was
> >>>>>
> >>>>> on a different NUMA node, plus that core was saturated.
> >>>>>
> >>>>>
> >>>>> Result was around 38% packets lost, calculated by comparing packets
> received
> >>>>>
> >>>>> with rx_missed. It's interesting that no other counter was
> increasing.
> >>>>>
> >>>>>
> >>>>> Right now I have moved card 0 to core 0 and card 1 to first core
> from the
> >>>>>
> >>>>> second CPU.
> >>>>>
> >>>>>
> >>>>> Now the rx_missed is around 6-7% for each card. Still way too much.
> >>>>>
> >>>>>
> >>>>> I send a total of 8-11Gbit/sec to both cars total, so each receives
> around
> >>>>>
> >>>>> half of that. Packet rate is 1.2Mpps top (also total). All kinds of
> packet
> >>>>>
> >>>>> sizes.
> >>>>>
> >>>>>
> >>>>> So if you are doing packet analysis I assume you don't need LRO or
> >>>>> GRO.  If not you may want to look into disabling them via "ethtool
> >>>>> -K".  I know RSC can sometimes cause packet drops due to aggregating
> a
> >>>>> number of frames before finally submitting them to the device.
> >>>>> Although that usually required ASPM to be enabled as well.
> >>>>>
> >>>>> I'll lower rings to 512 as the next step. Good to know about card's
> >>>>>
> >>>>> limitations.
> >>>>>
> >>>>>
> >>>>> Given that InterruptThrottleRate has to be given in a 'number of
> interrupts
> >>>>>
> >>>>> / second' what would you recommend I set it to, for a start at
> least? I have
> >>>>>
> >>>>> a 2.6Ghz Xeon E5 v3.
> >>>>>
> >>>>>
> >>>>> So I would recommend a value no less than 12500 for
> >>>>> InterruptThrottleRate.  Assuming a reasonable packet rate that should
> >>>>> give you a decent trade of in terms of performance versus latency.
> >>>>>
> >>>>> I'll buy a pair of X710 for a test as well. It will be an interesting
> >>>>>
> >>>>> comparison. Who knows, maybe the RSS implementation and MQ there is
> good
> >>>>>
> >>>>> enough for IDS to be used.
> >>>>>
> >>>>>
> >>>>> Fortunately I don't run it inline, this server receives a copy of
> traffic.
> >>>>>
> >>>>>
> >>>>> I'm not sure if it will get you much more throughput or not.  I still
> >>>>> find it odd that your dropping packets even though the device isn't
> >>>>> complaining about not having ring buffer resources.  Usually that
> >>>>> points to a bottleneck somewhere in the PCIe bus.  You might want to
> >>>>> double check and verify that the devices are connected directly to
> the
> >>>>> root complex and not some secondary bus on a PCIe switch that is
> >>>>> actually downgrading the link between the device and the CPU socket.
> >>>>>
> >>>>> On Sat, Sep 24, 2016 at 3:20 AM, Alexander Duyck <
> alexander.du...@gmail.com>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>> Well as a general rule anything over about 80usecs for
> >>>>>
> >>>>> InterruptThrottleRate is a waste.  One advantage to reducing the
> >>>>>
> >>>>> interrupt throttle rate is you can reduce the ring size and you might
> >>>>>
> >>>>> see a slight performance improvement.  One problem with using 4096
> >>>>>
> >>>>> descriptors is that it greatly increases the cache footprint and
> leads
> >>>>>
> >>>>> to more buffer-bloat and cache thrash as you have to evict old
> >>>>>
> >>>>> descriptors to pull in new ones.  I'm also sure if you are doing an
> >>>>>
> >>>>> intrusion detection system (I'm assuming that is what IDS is in
> >>>>>
> >>>>> reference to), then the users would appreciate it if you didn't add
> up
> >>>>>
> >>>>> to a half dozen extra milliseconds of latency to their network (worst
> >>>>>
> >>>>> case with an elephant flow of 1514 byte frames).
> >>>>>
> >>>>>
> >>>>> What size packets is it you are working with?  One limitation of the
> >>>>>
> >>>>> 82599 is that it can only handle an upper limit of somewhere around
> >>>>>
> >>>>> 12Mpps if you are using something like 6 queues, and only a little
> >>>>>
> >>>>> over 2 for a single queue.  If you exceed 12Mpps then the part will
> >>>>>
> >>>>> start reporting rx_missed because the PCIe overhead for moving 64
> byte
> >>>>>
> >>>>> packets is great enough that it actually causes us to exceed the
> >>>>>
> >>>>> limits of the x8 gen2 link.  If the memcpy is what I think it is then
> >>>>>
> >>>>> it allows us to avoid having to do two different atomic operations
> >>>>>
> >>>>> that would have been more expensive otherwise.
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 23, 2016 at 12:46 PM, Michał Purzyński
> >>>>>
> >>>>> <michalpurzyns...@gmail.com> wrote:
> >>>>>
> >>>>> Here's what I did
> >>>>>
> >>>>>
> >>>>> ethtool -A p1p1 rx off tx off
> >>>>>
> >>>>> ethtool -A p3p1 rx off tx off
> >>>>>
> >>>>>
> >>>>> Both ethtool -a <interface> and Arista that's pumping data show that
> >>>>>
> >>>>> RX/TX
> >>>>>
> >>>>> pause are disabled.
> >>>>>
> >>>>>
> >>>>> I have two cards, each connected to a separate NUMA node, threads
> >>>>>
> >>>>> pinned,
> >>>>>
> >>>>> etc.
> >>>>>
> >>>>>
> >>>>> One non-standard thing is that I use single queue only, because any
> form
> >>>>>
> >>>>> of
> >>>>>
> >>>>> multiqueue leads to packet reordering and confuses IDS. An issue
> that's
> >>>>>
> >>>>> been
> >>>>>
> >>>>> hidden for a while in the NSM community.
> >>>>>
> >>>>>
> >>>>> driver (from sourceforge) was loaded with MQ=0 DCA=2 RSS=1 VMDQ=0
> >>>>>
> >>>>> InterruptThrottleRate=956 FCoE=0 LRO=0 vxvlan_rx=0 (each option's
> value
> >>>>>
> >>>>> given enogh times so it applies to all cards in this system).
> >>>>>
> >>>>>
> >>>>> I could see the same issue sending traffic to just one card.
> >>>>>
> >>>>>
> >>>>> Of course a single core is swamped with ACK-ing hardware IRQ and then
> >>>>>
> >>>>> doing
> >>>>>
> >>>>> softIRQ (which seems to be mostly memcpy?). But then again, I don't
> see
> >>>>>
> >>>>> errors about lacking buffers (I run with 4096 descriptors).
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 23, 2016 at 9:22 PM, Alexander Duyck
> >>>>>
> >>>>> <alexander.du...@gmail.com>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>> When you say you disabled flow control did you disable it on the
> >>>>>
> >>>>> interface that is dropping packets or the other end?  You might try
> >>>>>
> >>>>> explicitly disabling it on the interface that is dropping packets,
> >>>>>
> >>>>> that in turn should enable per-queue drop instead of putting
> >>>>>
> >>>>> back-pressure onto the Rx FIFO.
> >>>>>
> >>>>>
> >>>>> With flow control disabled on the local port you should see
> >>>>>
> >>>>> rx_no_dma_resources start incrementing if the issue is that one of
> the
> >>>>>
> >>>>> Rx rings is not keeping up.
> >>>>>
> >>>>>
> >>>>> - Alex
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 23, 2016 at 11:09 AM, Michał Purzyński
> >>>>>
> >>>>> <michalpurzyns...@gmail.com> wrote:
> >>>>>
> >>>>> xoff was increasing so I disabled flow control.
> >>>>>
> >>>>>
> >>>>> That's a HP DL360 Gen9 and lspci -vvv tells me cards are connected to
> >>>>>
> >>>>> x8
> >>>>>
> >>>>> link, speed is 5GT/s and ASPM is disabled.
> >>>>>
> >>>>>
> >>>>> Other error counters are still zero. When I compared rx_packets and
> >>>>>
> >>>>> rx_missed_errors it looks like a 38% (!!) packets are getting lost.
> >>>>>
> >>>>>
> >>>>> Unfortunately HP documentation is a scam and they actively avoid
> >>>>>
> >>>>> publishing
> >>>>>
> >>>>> motherboard layout.
> >>>>>
> >>>>>
> >>>>> Any other place I could look for hints?
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 23, 2016 at 7:01 PM, Alexander Duyck
> >>>>>
> >>>>> <alexander.du...@gmail.com>
> >>>>>
> >>>>> wrote:
> >>>>>
> >>>>>
> >>>>> On Fri, Sep 23, 2016 at 1:10 AM, Michał Purzyński
> >>>>>
> >>>>> <michalpurzyns...@gmail.com> wrote:
> >>>>>
> >>>>> Hello.
> >>>>>
> >>>>>
> >>>>> On my IDS workload with af_packet I can see rx_missed_errors
> >>>>>
> >>>>> growing
> >>>>>
> >>>>> while
> >>>>>
> >>>>> rx_no_buffer_count does not. Basically every other kind of rx_
> >>>>>
> >>>>> error
> >>>>>
> >>>>> counter is 0, including rx_no_dma_resources. It's an 82599 based
> >>>>>
> >>>>> card.
> >>>>>
> >>>>>
> >>>>> I don't know what to think about that. I went through ixgbe source
> >>>>>
> >>>>> code
> >>>>>
> >>>>> and
> >>>>>
> >>>>> the 82599 datasheet and seems like rx_missed_error means a new
> >>>>>
> >>>>> packet
> >>>>>
> >>>>> overwrote something already in the packet buffer (FIFO queue on
> >>>>>
> >>>>> the
> >>>>>
> >>>>> card)
> >>>>>
> >>>>> because there was no more space in it.
> >>>>>
> >>>>>
> >>>>> Now, that would happen if there is no place to DMA packets into -
> >>>>>
> >>>>> but
> >>>>>
> >>>>> that
> >>>>>
> >>>>> counter does not grow.
> >>>>>
> >>>>>
> >>>>> Could you point me to where should I be looking for a problem?
> >>>>>
> >>>>>
> >>>>> --
> >>>>>
> >>>>> Michal Purzynski
> >>>>>
> >>>>>
> >>>>> The Rx missed count will increment if you are not able to receive a
> >>>>>
> >>>>> packet because the Rx FIFO is full.  If you are not seeing any
> >>>>>
> >>>>> rx_no_dma_resources problems it might indicate that the problem is
> >>>>>
> >>>>> not
> >>>>>
> >>>>> with providing the DMA resources, but a problem on the bus itself.
> >>>>>
> >>>>> You might want to double check the slot the device is connected to
> >>>>>
> >>>>> in
> >>>>>
> >>>>> order to guarantee that there is a x8 link that supports 5GT/s all
> >>>>>
> >>>>> the
> >>>>>
> >>>>> way through to the root complex.
> >>>>>
> >>>>>
> >>>>> - Alex
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

------------------------------------------------------------------------------

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] rx_missed_errors grows while rx_no_buffer does not

Reply via email to