Thanks for the answer!

We checked that PCIe, but our current environment does not cross PCIe as we
use only 1 numa node. Although we also see that if we cross the PCIe (using
remote numa nodes), the performance is slightly lower.

We will reduce the number of queue for support 10G in current status. But
do you know what kinds of processing overhead exist for RSS? Or just
processing RSS itself for multiple queue has the overhead?

We run another test that usleep() for between packet reading reduce the
packet drop ratio.
For example, we receive up to 64 packets at one time.
If we add small operations like usleep() or some time-consumed operations,
the packet drop become almost zero.

I have spent almost a month desperately trying to find the answer, but my
efforts were in vain.
Do you think that this problem can be solved at the software level, or is
it unsolvable considering hardware characteristics?

Regards,
Shinae

On Fri, Mar 22, 2013 at 12:21 AM, Fujinaka, Todd <[email protected]>wrote:

> The simple answer is that you are running a stress test and there is some
> finite processing being done separating traffic into queues which is
> showing some performance impact on your setup. At that rate you could also
> be running into limitations of the PCIe bus and the added latency of going
> across the QPI bus if you're using cores on remote CPUs.
>
> Depending on the driver and kernel version, we've seen best small packet
> performance at around 3 queues. Adding queues increases the processing
> required to separate the traffic into queues which, at some point, will not
> gain you anything. Since you're able to do 10G, it sounds like most of the
> system is properly working and the trick is to tune the system for whatever
> it is you're planning to use it for.
>
> Thanks.
>
> Todd Fujinaka
> Software Application Engineer
> Networking Division (ND)
> Intel Corporation
> [email protected]
> (503) 712-4565
>
>
> -----Original Message-----
> From: Shinae Woo [mailto:[email protected]]
> Sent: Wednesday, March 20, 2013 8:59 PM
> To: [email protected]
> Subject: [E1000-devel] Low receive performance with multiple RSS queue
>
> Hello, all,
>
>  We're observing a weird problem with receive-side scaling (RSS) on a
> 82599-based Intel NIC. In a nutshell, we do not see 10Gbps for 64B packet
> RX if we configure the NIC to use multiple (>1) RSS hardware queues while
> we _do_ see 10Gbps with a single RSS queue (with one CPU core). For a
> packet size larger than 80 bytes, we achieve a line rate for packet RX
> (e.g., 10Gbps) regardless of the number of RSS queues. I'm wondering if
> this is a hardware problem or if we missed anything in the driver.. We use
> a modified ixgbe driver (called packetshader IO engine) to bypass a severe
> kernel-level memory management overhead with a small packet size, and use
> batch processing of received packets (like NAPI). What we observe is
> summarized as follows,
>
>  * WIth 1 RSS queue with 1 CPU core, we do not see any single packet loss
> with input rate of 10Gbps with all 64 bytes
>  * With 6 RSS queues with 6 CPU cores, we see up to 10% packet drops at
> the NIC (64B packets)
>      - The loss rate increases as we increase # of RSS queues from 2 to 6
>      - However, even when we see packet drops, rx_descriptor is almost
> always empty (not full).
>
>
> We use a machine with two Intel Xeon X5690 CPUs and an Intel NIC with
> 82599 chipsets (Linux 2.6.32-42-server Ubuntu 12.04). PaketShader IO engine
> is based on ixgbe 2.0.38.2(PSIO :
> http://shader.kaist.edu/packetshader/io_engine/index.html), but even if
> we ported PSIO to ixgbe 3.12, we see the similar problem. Also, similar IO
> libraries like netmap (http://info.iet.unipi.it/~luigi/netmap/), and
> PF_RING (http://www.ntop.org/products/pf_ring/) show the same trend: the
> packet drop rate increases as we increase the number of RSS queues and use
> more CPU cores, which is counterintuitive since more CPU cores should
> improve the performance. This is why we suspect that the problem is
> somewhat related to the hardware (RSS) when the packet size is small (< 80
> bytes).
>
>
>  We have attached file that shows the packet RX performance from PSIO,
> NetMap, and PF_RING with various packet sizes and numbers of RSS queues.
>
>     PSIO-0.2 (based on ixgbe 2.0.38.2)
>     PF_RING-5.5.2 (based on ixgbe-3.11.33)
>     NetMap- 20120813 (based on ixgbe in kernel 3.2.9-k2)
>
>  Please let us know if you have experienced a similar problem or have any
> clue what's going on. We'll greatly appreciate your help.
>
>  Regards,
>  Shinae Woo
>
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_mar
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to