Hi Eric,

We have been doing some more research, and finally we managed to get a fast 
kernel with a new minimal configuration using the last release of the 2.6.32 
(2.6.32.60). We compiled what we think is a kernel with options that could 
affect network performance removed. We used our general networking/computing 
knowledge and reviewed every option, so I'm sure it's not perfect but should be 
a better start point than the one we had. Then we did a "make oldconfig" to 
adapt this ".config" to a 3.4.41 kernel reviewing every new option. 

But again we have the same situation, 2.6.32 kernel seems to be extremely fast 
(we have successfully routed more than 2Mpps) and 3.4 is not able to route more 
than 500Kpps, and again when we do a "perf top" we see _raw_spin_lock_irqsave 
consuming a lot of CPU.

BTW: Remember that we use for this test just one Xeon 5620 in a dual Xeon 
environment, so 50% CPU usage means we are using the 100% of the cores assigned 
to RSS queues.

Kernel 3.4.41 CLEAN 

Packet generator: Two bonesi instances each one sending about 500Kpps from 50K 
sources. 
Machine reaches 100% CPU usage on cores with RSS queues assigned, and doesn't 
route more than 500kpps. 

perf top output: http://pastebin.com/xByZnxL1
perf record output: http://pastebin.com/2idhM3V1

Kernel 2.6.32.60 CLEAN

Packet Generator: Two bonesi instances each one sending about 500Kpps 50K 
sources. 
Machine routes about 1Mpps (Bonesi doesn't generate exactly 500Kpps per 
instance, so a small difference  is normal).

perf record output: http://pastebin.com/gfvcQNZv
perf top output: http://pastebin.com/8Qpp604p

Note: We use a CISCO 3750G to measure the pps in each port, so we can see 
number of packets coming in and out (Flow control is disabled on the switch 
ports). 

As you may see, the main difference is that on kernel 3.4.41 
_raw_spin_lock_irqsave is using the 46.4% of the CPU (So, 92.8% of the Cores 
assigned to RSS queues).

Also, I'm starting to have doubts about this being related with IGB driver... 
but anyway if you have any idea that could lead us in the right direction it 
would be extremely useful :)

P.S.: I'm not familiar at all with lists, so I just did a "reply all" from my 
e-mail client but I see you'll get a direct copy of this mail...  Should it 
work like that or I shall just reply to the [email protected] 
mail address? If I did wrong please just let me know.

And... Thanks for your help :)

Saludos cordiales,
Xavier Trilla P.
Silicon Hosting

¿Todavía no conoces Bare Metal Cloud?
¡La evolución de los Servidores VPS ya ha llegado!

más información en: siliconhosting.com/cloud


-----Mensaje original-----
De: Eric Dumazet [mailto:[email protected]] 
Enviado el: martes, 16 de abril de 2013 17:42
Para: Xavier Trilla
CC: [email protected]; Arnau Marcé
Asunto: Re: [E1000-devel] Small UDP packets routing performance...

On Tue, 2013-04-16 at 06:16 +0000, Xavier Trilla wrote:
> Hi,
> 
> This is the first time I post here because I like to find solutions by 
> myself. But this  time I'm running out of ideas. (Well. The reality is 
> that we are running out of time, as at some point our boss will run 
> out of patience if we don't manage to deliver some results :P )
> 
> Our problem is that we are not able at all to replicate the performance we 
> got with a specific kernel one of my colleagues build once. Actually he build 
> that kernel not paying much attention to the options he was using (It was a 
> "fast and dirty" build. and now we are paying the consequences!) and it seems 
> he was extremely lucky (or inspired) that day, as we cannot reproduce the 
> performance that specific kernel delivers. 
> 
> So after 3 weeks running tests we decided that maybe it was about time 
> to ask, so here we are :)
> 
> Ok, so let's begging with the test lab we have setup (I will give you some 
> hardware details, but keep in mind that one kernel delivers about 3x 
> performance than others with the exact same HW configuration):
> 
> Router: 
> MB: SuperMicro X8DTN+F
> CPUs: 2 x Xeon 5620
> LANs: Integrated Intel 82576 Dual-Port Gigabit Ethernet Controller 
> HyperThreading Disabled IGB Driver load parameters: IntMode=2 
> InterruptThrottleRate=0,0 QueuePairs=0,0 RSS=4,4 IRQ Balance Disabled 
> (SMP affinity changed for RSS queues) All queues bind to the second 
> CPU (One RSS queue of each adapter bind to each core) Rp_filter 
> Disabled Ip_forwarding Enabled Iptables modules are NOT loaded Machine 
> is just doing IP forwarding across two interfaces
> 
> And basically all the rest is almost default, as we wanted to remove as many 
> variables as possible.
> 
> Receivers/Generators: 
> Xeon 5620 machines using Bonesi as Packet generator (UDP 64 with 50k 
> source addresses)
> 
> And here comes the interesting part, in this scenario using kernel 2.6.32.27 
> with igb driver 4.1.2 we manage to get around 1.5 Mpps but with all other 
> kernels we tried the maximum we get is less than 750 Kpps. So far we tried 
> with kernels 3.0.73, 3.2.43 and 3.4.40. (We still need to try with 2.6.34.14 
> and we are solving a problem with 2.6.32.60 because it doesn't boot. probably 
> is a problem related to our LSI raid controller) and no success.
> 
> While investigating about this issue (Keep in mind that we are more 
> networking/sysadmin guys. And yes, we may have a quite good knowledge of 
> linux, but we are really far away from you guys when it comes to the kernel 
> and networking drivers. ) the only way we managed to find a difference has 
> been using "perf top" on the machine while using the 2.6.32.27 and other 3.X 
> kernels, and the main difference we found has been:
> 
> Kernel 2.6.32.27: 
> 
> The top consuming function is "igb_poll": As I understand as the network is 
> under heavy load the kernel stats operating the interface in NAPI polling 
> mode, so everything seems to be normal and performance is really good.
> 
> Kernel 3.4.40 (We have seen similar behavious on other kernels)
> 
> Here things look completely different, and _raw_sping_lock_irqsave is 
> consuming the 58% of the resources. (Quite big, isn't it?)
> 
> With my really low understanding I guess this is a process that spins 
> and that might be the reason for the performance difference among 
> kernels. But as _raw_sping_lock_irqsave it's a commonly call function 
> we are not close at all to identifiying the real reason of the 
> performance degradation and how to avoid it)
> 
> So, does anybody have an any idea about why we see this massive 
> difference in performance? (Or at least an idea that could lead us to 
> the answer...)
> 
> And few more questions (Just in case nobody knows the answer to the previous 
> question) : 
> 
> - Do you think it is kernel or driver related? (we realized igb driver 
> configures itself depending on the kernel version, so we are not sure)
> - Any extremely important parameters when compiling the kernel we might me 
> forgetting? 
> - Any documentation you consider we should read? (BTW, we have seen 
> Intel results when forgarding packets with Nehalem CPUs... But some 
> information about how do you achieve  those astoning results would be 
> really apreciated :))
> 
> Thanks for your time!

Use "perf record -a -g sleep 10 ; perf report" instead of "perf top" :
we'll catch the call graphs.




------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to