Andrew
on the other 710 box the other user (with problems) numbers are

[khobdeh@wrshrk-do4 Scripts]$ numademo 128M memcpy
2 nodes available
memory with no policy memcpy              Avg 5762.69 MB/s Max 5769.58 MB/s Min 
                                                                                
                  5758.19 MB/s
local memory memcpy                       Avg 5564.86 MB/s Max 5571.51 MB/s Min 
                                                                                
                  5558.13 MB/s
memory interleaved on all nodes memcpy    Avg 5542.73 MB/s Max 5551.23 MB/s Min 
5537.03 MB/s
memory on node 0 memcpy                   Avg 11055.01 MB/s Max 11065.85 MB/s 
Min 11044.91 MB/s
memory on node 1 memcpy                   Avg 5416.99 MB/s Max 5421.19 MB/s Min 
5411.35 MB/s
memory interleaved on 0 1 memcpy          Avg 6657.30 MB/s Max 6666.55 MB/s Min 
6648.72 MB/s
setting preferred node to 0
memory without policy memcpy              Avg 6712.97 MB/s Max 6722.31 MB/s Min 
6702.84 MB/s
setting preferred node to 1
memory without policy memcpy              Avg 5619.31 MB/s Max 5625.93 MB/s Min 
5614.63 MB/s
manual interleaving to all nodes memcpy   Avg 6615.03 MB/s Max 6641.81 MB/s Min 
6473.63 MB/s
manual interleaving on node 0/1 memcpy    Avg 6633.18 MB/s Max 6636.56 MB/s Min 
6628.36 MB/s
current interleave node 0
running on node 0, preferred node 0
local memory memcpy                       Avg 6654.09 MB/s Max 6660.27 MB/s Min 
6649.05 MB/s
memory interleaved on all nodes memcpy    Avg 6657.69 MB/s Max 6664.57 MB/s Min 
6651.69 MB/s
memory interleaved on node 0/1 memcpy     Avg 6633.05 MB/s Max 6639.51 MB/s Min 
6624.44 MB/s
alloc on node 1 memcpy                    Avg 5420.20 MB/s Max 5423.82 MB/s Min 
5417.03 MB/s
local allocation memcpy                   Avg 6641.88 MB/s Max 6649.71 MB/s Min 
6635.24 MB/s
setting wrong preferred node memcpy       Avg 5623.92 MB/s Max 5627.58 MB/s Min 
5620.27 MB/s
setting correct preferred node memcpy     Avg 6653.27 MB/s Max 6663.24 MB/s Min 
6642.80 MB/s
running on node 1, preferred node 0
local memory memcpy                       Avg 7789.23 MB/s Max 7801.09 MB/s Min 
7769.48 MB/s
memory interleaved on all nodes memcpy    Avg 6357.53 MB/s Max 6361.03 MB/s Min 
6352.60 MB/s
memory interleaved on node 0/1 memcpy     Avg 6430.76 MB/s Max 6434.83 MB/s Min 
6427.74 MB/s
alloc on node 0 memcpy                    Avg 5422.83 MB/s Max 5427.54 MB/s Min 
5417.90 MB/s
local allocation memcpy                   Avg 7782.09 MB/s Max 7788.87 MB/s Min 
7771.28 MB/s
setting wrong preferred node memcpy       Avg 5436.43 MB/s Max 5441.85 MB/s Min 
5432.16 MB/s
setting correct preferred node memcpy     Avg 7779.38 MB/s Max 7788.87 MB/s Min 
7763.64 MB/s

So my conclusion are
- that these two 710 Dell's are very different.
- on boxes with low MB/s I see issues, we don't have on faster boxes.

Donald: what do you think?

Luca


On Sep 15, 2011, at 5:30 PM, <andrew_leh...@agilent.com> wrote:

> Hi Donald, Luca
> 
> OK, could this be it???
> 
> R810 first
> 
> # numademo 128M memcpy
> 4 nodes available
> memory with no policy memcpy              Avg 6103.61 MB/s Max 6114.70 MB/s 
> Min 6020.62 MB/s
> local memory memcpy                       Avg 6112.45 MB/s Max 6113.59 MB/s 
> Min 6109.69 MB/s
> memory interleaved on all nodes memcpy    Avg 4596.04 MB/s Max 4598.86 MB/s 
> Min 4592.72 MB/s
> memory on node 0 memcpy                   Avg 4298.76 MB/s Max 4299.65 MB/s 
> Min 4297.44 MB/s
> memory on node 1 memcpy                   Avg 4311.93 MB/s Max 4318.46 MB/s 
> Min 4263.59 MB/s
> memory on node 2 memcpy                   Avg 4224.10 MB/s Max 4230.39 MB/s 
> Min 4174.73 MB/s
> memory on node 3 memcpy                   Avg 6103.11 MB/s Max 6115.82 MB/s 
> Min 6009.03 MB/s
> memory interleaved on 0 1 memcpy          Avg 4272.03 MB/s Max 4274.59 MB/s 
> Min 4270.51 MB/s
> memory interleaved on 0 2 memcpy          Avg 4229.02 MB/s Max 4232.00 MB/s 
> Min 4227.33 MB/s
> memory interleaved on 1 2 memcpy          Avg 4238.95 MB/s Max 4241.36 MB/s 
> Min 4235.47 MB/s
> memory interleaved on 0 1 2 memcpy        Avg 4254.17 MB/s Max 4255.88 MB/s 
> Min 4251.84 MB/s
> memory interleaved on 0 3 memcpy          Avg 5007.41 MB/s Max 5008.31 MB/s 
> Min 5006.44 MB/s
> memory interleaved on 1 3 memcpy          Avg 5015.63 MB/s Max 5017.49 MB/s 
> Min 5014.11 MB/s
> memory interleaved on 0 1 3 memcpy        Avg 4737.35 MB/s Max 4746.87 MB/s 
> Min 4677.06 MB/s
> memory interleaved on 2 3 memcpy          Avg 4966.48 MB/s Max 4967.53 MB/s 
> Min 4965.69 MB/s
> memory interleaved on 0 2 3 memcpy        Avg 4693.85 MB/s Max 4710.88 MB/s 
> Min 4636.51 MB/s
> memory interleaved on 1 2 3 memcpy        Avg 4716.13 MB/s Max 4718.17 MB/s 
> Min 4714.19 MB/s
> memory interleaved on 0 1 2 3 memcpy      Avg 4583.39 MB/s Max 4597.28 MB/s 
> Min 4530.56 MB/s
> setting preferred node to 0
> memory without policy memcpy              Avg 4294.01 MB/s Max 4300.47 MB/s 
> Min 4243.50 MB/s
> setting preferred node to 1
> memory without policy memcpy              Avg 4318.67 MB/s Max 4319.43 MB/s 
> Min 4315.13 MB/s
> setting preferred node to 2
> memory without policy memcpy              Avg 4225.77 MB/s Max 4231.19 MB/s 
> Min 4180.33 MB/s
> setting preferred node to 3
> memory without policy memcpy              Avg 6103.52 MB/s Max 6115.82 MB/s 
> Min 6007.69 MB/s
> manual interleaving to all nodes memcpy   Avg 4590.92 MB/s Max 4599.65 MB/s 
> Min 4531.93 MB/s
> manual interleaving on node 0/1 memcpy    Avg 4267.89 MB/s Max 4275.27 MB/s 
> Min 4216.57 MB/s
> current interleave node 0
> running on node 0, preferred node 0
> local memory memcpy                       Avg 6055.75 MB/s Max 6058.12 MB/s 
> Min 6052.39 MB/s
> memory interleaved on all nodes memcpy    Avg 4620.07 MB/s Max 4634.59 MB/s 
> Min 4576.44 MB/s
> memory interleaved on node 0/1 memcpy     Avg 4991.07 MB/s Max 5009.43 MB/s 
> Min 4869.84 MB/s
> alloc on node 1 memcpy                    Avg 4328.33 MB/s Max 4329.74 MB/s 
> Min 4319.57 MB/s
> alloc on node 2 memcpy                    Avg 4338.42 MB/s Max 4356.30 MB/s 
> Min 4269.15 MB/s
> alloc on node 3 memcpy                    Avg 4318.31 MB/s Max 4326.39 MB/s 
> Min 4302.68 MB/s
> local allocation memcpy                   Avg 6057.60 MB/s Max 6058.94 MB/s 
> Min 6055.12 MB/s
> setting wrong preferred node memcpy       Avg 4326.59 MB/s Max 4329.60 MB/s 
> Min 4317.35 MB/s
> setting correct preferred node memcpy     Avg 6045.33 MB/s Max 6058.67 MB/s 
> Min 5944.89 MB/s
> running on node 1, preferred node 0
> local memory memcpy                       Avg 6069.85 MB/s Max 6070.73 MB/s 
> Min 6068.53 MB/s
> memory interleaved on all nodes memcpy    Avg 4621.28 MB/s Max 4624.05 MB/s 
> Min 4618.80 MB/s
> memory interleaved on node 0/1 memcpy     Avg 5005.25 MB/s Max 5006.82 MB/s 
> Min 4996.56 MB/s
> alloc on node 0 memcpy                    Avg 4314.46 MB/s Max 4321.52 MB/s 
> Min 4256.83 MB/s
> alloc on node 2 memcpy                    Avg 4291.12 MB/s Max 4291.81 MB/s 
> Min 4290.44 MB/s
> alloc on node 3 memcpy                    Avg 4336.58 MB/s Max 4342.77 MB/s 
> Min 4288.24 MB/s
> local allocation memcpy                   Avg 6070.04 MB/s Max 6072.10 MB/s 
> Min 6068.53 MB/s
> setting wrong preferred node memcpy       Avg 4285.92 MB/s Max 4291.40 MB/s 
> Min 4239.75 MB/s
> setting correct preferred node memcpy     Avg 6069.96 MB/s Max 6071.00 MB/s 
> Min 6068.81 MB/s
> running on node 2, preferred node 0
> local memory memcpy                       Avg 6109.00 MB/s Max 6122.51 MB/s 
> Min 5995.61 MB/s
> memory interleaved on all nodes memcpy    Avg 4588.44 MB/s Max 4592.41 MB/s 
> Min 4585.66 MB/s
> memory interleaved on node 0/1 memcpy     Avg 4257.87 MB/s Max 4258.99 MB/s 
> Min 4255.07 MB/s
> alloc on node 0 memcpy                    Avg 4314.82 MB/s Max 4321.66 MB/s 
> Min 4259.66 MB/s
> alloc on node 1 memcpy                    Avg 4262.60 MB/s Max 4269.01 MB/s 
> Min 4216.84 MB/s
> alloc on node 3 memcpy                    Avg 4228.30 MB/s Max 4229.86 MB/s 
> Min 4224.93 MB/s
> local allocation memcpy                   Avg 6109.92 MB/s Max 6122.79 MB/s 
> Min 6011.72 MB/s
> setting wrong preferred node memcpy       Avg 4223.70 MB/s Max 4229.73 MB/s 
> Min 4175.90 MB/s
> setting correct preferred node memcpy     Avg 6109.78 MB/s Max 6122.51 MB/s 
> Min 6002.58 MB/s
> running on node 3, preferred node 0
> local memory memcpy                       Avg 6113.31 MB/s Max 6114.15 MB/s 
> Min 6111.36 MB/s
> memory interleaved on all nodes memcpy    Avg 4595.63 MB/s Max 4597.44 MB/s 
> Min 4594.45 MB/s
> memory interleaved on node 0/1 memcpy     Avg 4269.70 MB/s Max 4271.32 MB/s 
> Min 4262.50 MB/s
> alloc on node 0 memcpy                    Avg 4292.58 MB/s Max 4299.23 MB/s 
> Min 4236.14 MB/s
> alloc on node 1 memcpy                    Avg 4312.16 MB/s Max 4318.60 MB/s 
> Min 4263.45 MB/s
> alloc on node 2 memcpy                    Avg 4224.20 MB/s Max 4230.26 MB/s 
> Min 4178.37 MB/s
> local allocation memcpy                   Avg 6113.42 MB/s Max 6114.42 MB/s 
> Min 6112.20 MB/s
> setting wrong preferred node memcpy       Avg 4292.88 MB/s Max 4300.75 MB/s 
> Min 4228.53 MB/s
> setting correct preferred node memcpy     Avg 6102.86 MB/s Max 6116.10 MB/s 
> Min 5999.90 MB/s
> 
> Now R710
> 
> # numademo 128M memcpy
> 2 nodes available
> memory with no policy memcpy              Avg 16900.16 MB/s Max 17843.36 MB/s 
> Min 13960.65 MB/s
> local memory memcpy                       Avg 17831.27 MB/s Max 17840.98 MB/s 
> Min 17772.47 MB/s
> memory interleaved on all nodes memcpy    Avg 13256.20 MB/s Max 13335.09 MB/s 
> Min 12613.26 MB/s
> memory on node 0 memcpy                   Avg 17838.38 MB/s Max 17843.36 MB/s 
> Min 17831.50 MB/s
> memory on node 1 memcpy                   Avg 10849.47 MB/s Max 10855.53 MB/s 
> Min 10843.25 MB/s
> memory interleaved on 0 1 memcpy          Avg 13330.99 MB/s Max 13333.77 MB/s 
> Min 13324.50 MB/s
> setting preferred node to 0
> memory without policy memcpy              Avg 17717.58 MB/s Max 17840.98 MB/s 
> Min 16712.46 MB/s
> setting preferred node to 1
> memory without policy memcpy              Avg 10852.45 MB/s Max 10856.40 MB/s 
> Min 10846.75 MB/s
> manual interleaving to all nodes memcpy   Avg 13331.78 MB/s Max 13333.77 MB/s 
> Min 13329.80 MB/s
> manual interleaving on node 0/1 memcpy    Avg 13306.01 MB/s Max 13333.77 MB/s 
> Min 13082.93 MB/s
> current interleave node 0
> running on node 0, preferred node 0
> local memory memcpy                       Avg 17603.71 MB/s Max 17840.98 MB/s 
> Min 16708.29 MB/s
> memory interleaved on all nodes memcpy    Avg 13327.68 MB/s Max 13333.77 MB/s 
> Min 13295.47 MB/s
> memory interleaved on node 0/1 memcpy     Avg 13331.92 MB/s Max 13333.77 MB/s 
> Min 13329.80 MB/s
> alloc on node 1 memcpy                    Avg 10734.41 MB/s Max 10855.53 MB/s 
> Min 10188.85 MB/s
> local allocation memcpy                   Avg 17838.14 MB/s Max 17840.98 MB/s 
> Min 17836.24 MB/s
> setting wrong preferred node memcpy       Avg 10467.28 MB/s Max 10855.53 MB/s 
> Min 7928.27 MB/s
> setting correct preferred node memcpy     Avg 17836.95 MB/s Max 17840.98 MB/s 
> Min 17831.50 MB/s
> running on node 1, preferred node 0
> local memory memcpy                       Avg 17358.28 MB/s Max 17843.36 MB/s 
> Min 13969.37 MB/s
> memory interleaved on all nodes memcpy    Avg 13332.18 MB/s Max 13335.09 MB/s 
> Min 13313.93 MB/s
> memory interleaved on node 0/1 memcpy     Avg 13334.56 MB/s Max 13336.42 MB/s 
> Min 13332.45 MB/s
> alloc on node 0 memcpy                    Avg 10852.10 MB/s Max 10854.65 MB/s 
> Min 10851.14 MB/s
> local allocation memcpy                   Avg 17837.43 MB/s Max 17843.36 MB/s 
> Min 17833.87 MB/s
> setting wrong preferred node memcpy       Avg 10853.24 MB/s Max 10855.53 MB/s 
> Min 10850.26 MB/s
> setting correct preferred node memcpy     Avg 17839.09 MB/s Max 17840.98 MB/s 
> Min 17833.87 MB/s
> 
> 
> That's quite a difference!
> 
> Luca, could you run the same on your test box, so we can compare?
> 
> Andrew
> 
> 
> -----Original Message-----
> From: Skidmore, Donald C [mailto:donald.c.skidm...@intel.com]
> Sent: Thursday, September 15, 2011 4:19 PM
> To: Luca Deri; LEHANE,ANDREW (A-Scotland,ex1)
> Cc: e1000-devel@lists.sourceforge.net
> Subject: RE: Problems with Dell R810.
> 
> Hey Luca,
> 
> Sounds like your memory may be a fair amount lot slower on the larger system. 
>  This isn't unusual as these systems also support higher memory limits.  One 
> quick way to test would be to run numademo -
> 
> numademo 128M memcpy
> 
> to see the diff's between the two systems.
> 
> Thanks,
> -Don Skidmore <donald.c.skidm...@intel.com>
> 
> 
> 
>> -----Original Message-----
>> From: Luca Deri [mailto:d...@ntop.org]
>> Sent: Thursday, September 15, 2011 7:52 AM
>> To: andrew_leh...@agilent.com
>> Cc: Skidmore, Donald C; e1000-devel@lists.sourceforge.net
>> Subject: Re: Problems with Dell R810.
>> 
>> Andrew
>> just to be precise (I don't want to tease you of course), on a X3440 we
>> can send 14.88 Mpps (~ 26 Mpps on two ports) so we're quite close now.
>> As of the 710 problem I have reported, I will ask the 710 user who has
>> reported the issue.
>> 
>> Now the question is: where all these issues are coming from? Why a 810
>> (more powerful than a 710) reports a much poor performance? Do you have
>> the chance to read the BIOS revision of your 710 so I can compare it
>> with the one of the other use who as issues?
>> 
>> This said: great news.
>> 
>> Cheers Luca
>> 
>> On Sep 15, 2011, at 4:45 PM, <andrew_leh...@agilent.com> wrote:
>> 
>>> Hi Donald and Luca,
>>> 
>>> I have managed to obtain the loan a R710 and using the Silicom card
>> and Luca's code I can send in excess of 14Million packets per sec, so
>> whatever the problem with the R710 Luca has reported it is not the same
>> as my issues with the R810! Of course, unless my R810 has suffered the
>> same fault as the R710 listed below and both are now broken in the same
>> way. Does a reboot clear your other user's problem Luca or is it
>> permanent?
>>> 
>>> Luca here's the details...
>>> 
>>> ./pfsend -i dna:eth4 -g 1 -l 60 -n 0 -r 10
>>> 
>>> TX rate: [current 14'238'148.23 pps/9.57 Gbps][average 14'223'555.75
>> pps/9.56 Gbps][total 2'147'799'248.00 pkts]
>>> TX rate: [current 14'240'502.43 pps/9.57 Gbps][average 14'223'667.24
>> pps/9.56 Gbps][total 2'162'040'021.00 pkts]
>>> TX rate: [current 14'239'155.21 pps/9.57 Gbps][average 14'223'768.47
>> pps/9.56 Gbps][total 2'176'279'461.00 pkts]
>>> TX rate: [current 14'238'531.22 pps/9.57 Gbps][average 14'223'864.33
>> pps/9.56 Gbps][total 2'190'518'277.00 pkts]
>>> 
>>> Thanks
>>> 
>>> Andrew
>>> 
>>> -----Original Message-----
>>> From: Luca Deri [mailto:d...@ntop.org]
>>> Sent: Thursday, September 15, 2011 3:05 PM
>>> To: Skidmore, Donald C
>>> Cc: LEHANE,ANDREW (A-Scotland,ex1); e1000-devel@lists.sourceforge.net
>>> Subject: Re: Problems with Dell R810.
>>> 
>>> Donald
>>> I have been reported by another PF_RING user of the following problem
>> (Dell 710 and Intel 82576):
>>> 
>>> Wed Sep 14 2011 06:00:11 An OEM diagnostic event has occurred.
>>> Critical 0.000009Wed Sep 14 2011 06:00:11 A bus fatal error was
>> detected on a component at bus 0 device 6 function 0.
>>> Critical 0.000008Wed Sep 14 2011 06:00:11 A bus fatal error was
>> detected on a component at slot 1.
>>> Normal 0.000007Wed Sep 14 2011 06:00:11 An OEM diagnostic event has
>> occurred.
>>> Critical 0.000006Wed Sep 14 2011 06:00:11 A bus fatal error was
>> detected on a component at bus 0 device 5 function 0.
>>> Critical 0.000005Wed Sep 14 2011 06:00:10 A bus fatal error was
>> detected on a component at slot 2.
>>> Normal 0.000004Wed Sep 14 2011 06:00:08 An OEM diagnostic event has
>> occurred.
>>> Critical 0.000003Wed Sep 14 2011 06:00:08 A bus fatal error was
>> detected on a component at bus 0 device 6 function 0.
>>> Critical 0.000002Wed Sep 14 2011 06:00:08 A bus fatal error was
>> detected on a component at slot 1.
>>> Normal 0.000001Wed Sep 14 2011 06:00:08 An OEM diagnostic event has
>> occurred.
>>> 
>>> 
>>> Additionally, we captured the following logs as well:
>>> alloc kstat_irqs on node -1
>>> pcieport 0000:00:09.0: irq 62 for MSI/MSI-X pcieport 0000:00:09.0:
>> setting latency timer to 64 aer 0000:00:01.0:pcie02: PCIe errors
>> handled by platform firmware.
>>> aer 0000:00:03.0:pcie02: PCIe errors handled by platform firmware.
>>> aer 0000:00:04.0:pcie02: PCIe errors handled by platform firmware.
>>> aer 0000:00:05.0:pcie02: PCIe errors handled by platform firmware.
>>> aer 0000:00:06.0:pcie02: PCIe errors handled by platform firmware.
>>> aer 0000:00:07.0:pcie02: PCIe errors handled by platform firmware.
>>> aer 0000:00:09.0:pcie02: PCIe errors handled by platform firmware.
>>> 
>>> I believe there's a BIOS issue on Dell's. What do you think?
>>> 
>>> Regards Luca
>>> 
>>> 
>>> On Sep 4, 2011, at 1:25 PM, Luca Deri wrote:
>>> 
>>>> Donald
>>>> thanks for the reply. I don't think this is a PF_RING issue (even
>> using the vanilla ixgbe driver we observe the same behavior) but rather
>> a Dell/Intel issue. From what I see on dmesg, it seems that DCA is
>> disabled and we have no way to enable it. I'm not sure if this is due
>> to BIOS limitations. What I can tell you is that a low-end core2duo is
>> much faster than this multiprocessor machine, and this is an indication
>> that there's something wrong on this setup.
>>>> 
>>>> Regards Luca
>>>> 
>>>> On Sep 3, 2011, at 2:33 AM, Skidmore, Donald C wrote:
>>>> 
>>>>>> -----Original Message-----
>>>>>> From: andrew_leh...@agilent.com [mailto:andrew_leh...@agilent.com]
>>>>>> Sent: Thursday, September 01, 2011 2:17 AM
>>>>>> To: e1000-devel@lists.sourceforge.net
>>>>>> Cc: d...@ntop.org
>>>>>> Subject: [E1000-devel] Problems with Dell R810.
>>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I recently purchased as Dell R810 for use with Luca Deri's PF_RING
>>>>>> networking driver for the 10 Gigabit PCI Express Network Driver
>>>>>> and the Silicom 10Gig card that uses the 82599EB chipset, machine
>>>>>> is running Fedora Core 14.
>>>>>> 
>>>>>> Luca's driver is described here:
>>>>>> http://www.ntop.org/blog/pf_ring/introducing-the-10-gbit-pf_ring-
>> dna
>>>>>> -
>>>>>> driver/
>>>>>> 
>>>>>> Only the machine doesn't seem to want to play ball. We have tried
>>>>>> a number of things and so eventually Luca suggested this mailing
>> list,
>>>>>> I do hope someone can help?
>>>>>> 
>>>>>> The machine spec is as follows.
>>>>>> 
>>>>>> 2x Intel Xeon L7555 Processor (1.86GHz, 8C, 24M Cache, 5.86 GT/s
>>>>>> QPI, 95W TDP, Turbo, HT), DDR3-980MHz 128GB Memory for 2/4CPU
>>>>>> (16x8GB Quad Rank LV RDIMMs) 1066MHz Additional 2x Intel Xeon
>>>>>> L7555 Processor (1.86GHz, 8C, 24M Cache, 5.86 GT/s QPI, 95W TDP,
>>>>>> Turbo, HT), Upgrade to 4CPU
>>>>>> 2 600GB SAS 6Gbps 10k 2.5" HD
>>>>>> Silicom 82599EB 10 Gigabit Ethernet NIC.
>>>>>> 
>>>>>> According to Luca's experiments on his test machine, not an R810
>>>>>> (actually quite a low spec machine by comparison) we should be
>>>>>> getting the following results, unfortunately, the R810 performance
>>>>>> is very poor; it struggles at less than 8% capacity of a 10 Gig
>> link
>>>>>> on one core; Luca's test application (byte and packet counts only)
>>>>>> and his machine can process a 100% of a 10 Gig Link on one core.
>>>>>> 
>>>>>> http://www.ntop.org/blog/pf_ring/how-to-sendreceive-26mpps-using-
>>>>>> pf_ring-on-commodity-hardware/
>>>>>> 
>>>>>> Importantly, Luca also seems to be getting excellent CPU usage
>>>>>> figures, see the bottom of the page, indicating that both DCA and
>>>>>> IOATDMA are operating correctly. My problem is that even on light
>>>>>> network loads my CPU hits 100% and packets are dropped,
>>>>>> indicating, to me, that DCA/IOATDMA isn't working.
>>>>>> 
>>>>>> I have switched on IOATDMA in the Dell's BIOS, it's off by
>>>>>> default, and discovered the following site where it talks about
>>>>>> configuring
>> a
>>>>>> machine to use DCA and IOATDMA etc. I even found a chap who
>> reported
>>>>>> similar performance problems but with a Dell R710 and how he fixed
>>>>>> it. I tried all this but still no improvement!
>>>>>> 
>>>>>> http://www.mail-archive.com/ntop-
>> m...@listgateway.unipi.it/msg01185.
>>>>>> html
>>>>>> 
>>>>>> The R810 seems to use a 7500 chipset.
>>>>>> 
>>>>>> 
>> http://www.dell.com/downloads/global/products/pedge/pedge_r810_specs
>>>>>> heet
>>>>>> _en.pdf
>>>>>> 
>>>>>> So, I think this is the R810 chipset reference http://www-
>>>>>> techdoc.intel.com/content/dam/doc/datasheet/7500-chipset-
>> datasheet.p
>>>>>> df,
>>>>>> see page 453
>>>>>> 
>>>>>> The program sets bits (0x8C @ bit  0) but it doesn't seem to stay
>>>>>> set, so consecutive calls to "dca_probe" seem to always say "DCA
>>>>>> disabled, enabling now."
>>>>>> 
>>>>>> I commented out some of the defines in the original code as they
>> are
>>>>>> already set in the Linux kernel and, of course, changed the
>>>>>> registers to point to the ones on page 453 - I hope they are
>> correct.
>>>>>> 
>>>>>> Still no luck the CPU usage is way too high.
>>>>>> 
>>>>>> #define _XOPEN_SOURCE 500
>>>>>> 
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> #include <pci/pci.h>
>>>>>> #include <sys/io.h>
>>>>>> #include <fcntl.h>
>>>>>> #include <sys/stat.h>
>>>>>> #include <sys/types.h>
>>>>>> #include <unistd.h>
>>>>>> 
>>>>>> #define INTEL_BRIDGE_DCAEN_OFFSET    0x8c
>>>>>> #define INTEL_BRIDGE_DCAEN_BIT       0
>>>>>> /*#define PCI_HEADER_TYPE_BRIDGE          1 */
>>>>>> /*#define PCI_VENDOR_ID_INTEL              0x8086 *//* lol @ intel
>> */
>>>>>> /*#define PCI_HEADER_TYPE            0x0e */
>>>>>> #define MSR_P6_DCA_CAP               0x000001f8
>>>>>> #define NUM_CPUS                     64
>>>>>> 
>>>>>> void check_dca(struct pci_dev *dev) {
>>>>>> u32 dca = pci_read_long(dev, INTEL_BRIDGE_DCAEN_OFFSET);
>>>>>> printf("DCA old value %d.\n", dca);  if (!(dca & (1 <<
>>>>>> INTEL_BRIDGE_DCAEN_BIT))) {
>>>>>>       printf("DCA disabled, enabling now.\n");
>>>>>>         dca |= 1 << INTEL_BRIDGE_DCAEN_BIT;
>>>>>>       printf("DCA new value %d.\n", dca);
>>>>>>       pci_write_long(dev, INTEL_BRIDGE_DCAEN_OFFSET, dca);  }
>>>>>> else {
>>>>>>       printf("DCA already enabled!\n");  } }
>>>>>> 
>>>>>> void msr_dca_enable(void)
>>>>>> {
>>>>>> char msr_file_name[64];
>>>>>> int fd = 0, i = 0;
>>>>>> u64 data;
>>>>>> 
>>>>>> for (;i < NUM_CPUS; i++) {
>>>>>>       sprintf(msr_file_name, "/dev/cpu/%d/msr", i);
>>>>>>         fd = open(msr_file_name, O_RDWR);
>>>>>>       if (fd < 0) {
>>>>>>            perror("open failed!");
>>>>>>            exit(1);
>>>>>>       }
>>>>>>       if (pread(fd, &data, sizeof(data), MSR_P6_DCA_CAP) !=
>>>>>> sizeof(data)) {
>>>>>>            perror("reading msr failed!");
>>>>>>            exit(1);
>>>>>>       }
>>>>>> 
>>>>>>       printf("got msr value: %*llx\n", 1, (unsigned long
>>>>>> long)data);
>>>>>>       if (!(data & 1)) {
>>>>>>            data |= 1;
>>>>>>            if (pwrite(fd, &data, sizeof(data), MSR_P6_DCA_CAP) !=
>>>>>> sizeof(data)) {
>>>>>>                 perror("writing msr failed!");
>>>>>>                 exit(1);
>>>>>>            }
>>>>>>       } else {
>>>>>>            printf("msr already enabled for CPU %d\n", i);
>>>>>>       }
>>>>>> }
>>>>>> }
>>>>>> 
>>>>>> int main(void)
>>>>>> {
>>>>>> struct pci_access *pacc;
>>>>>> struct pci_dev *dev;
>>>>>> u8 type;
>>>>>> 
>>>>>> pacc = pci_alloc();
>>>>>> pci_init(pacc);
>>>>>> 
>>>>>> pci_scan_bus(pacc);
>>>>>> for (dev = pacc->devices; dev; dev=dev->next) {
>>>>>>       pci_fill_info(dev, PCI_FILL_IDENT | PCI_FILL_BASES);
>>>>>>       if (dev->vendor_id == PCI_VENDOR_ID_INTEL) {
>>>>>>           type = pci_read_byte(dev, PCI_HEADER_TYPE);
>>>>>>           if (type == PCI_HEADER_TYPE_BRIDGE) {
>>>>>>            check_dca(dev);
>>>>>>           }
>>>>>>       }
>>>>>> }
>>>>>> 
>>>>>> msr_dca_enable();
>>>>>> return 0;
>>>>>> }
>>>>>> 
>>>>>> As you can see ixgbe, dca and ioatdma modules are loaded.
>>>>>> 
>>>>>> # lsmod
>>>>>> 
>>>>>> Module                  Size  Used by
>>>>>> ixgbe                 200547  0
>>>>>> pf_ring               327754  4
>>>>>> tcp_lp                  2111  0
>>>>>> fuse                   61934  3
>>>>>> sunrpc                201569  1
>>>>>> ip6t_REJECT             4263  2
>>>>>> nf_conntrack_ipv6      18078  4
>>>>>> ip6table_filter         1687  1
>>>>>> ip6_tables             17497  1 ip6table_filter
>>>>>> ipv6                  286505  184 ip6t_REJECT,nf_conntrack_ipv6
>>>>>> uinput                  7368  0
>>>>>> ioatdma                51376  72
>>>>>> i7core_edac            16210  0
>>>>>> dca                     5590  2 ixgbe,ioatdma
>>>>>> bnx2                   65569  0
>>>>>> mdio                    3934  0
>>>>>> ses                     6319  0
>>>>>> dcdbas                  8540  0
>>>>>> edac_core              41336  1 i7core_edac
>>>>>> iTCO_wdt               11256  0
>>>>>> iTCO_vendor_support     2610  1 iTCO_wdt
>>>>>> power_meter             9545  0
>>>>>> hed                     2206  0
>>>>>> serio_raw               4640  0
>>>>>> microcode              18662  0
>>>>>> enclosure               7518  1 ses
>>>>>> megaraid_sas           37653  2
>>>>>> 
>>>>>> # uname -a
>>>>>> Linux test 2.6.35.14-95.fc14.x86_64 #1 SMP Tue Aug 16 21:01:58 UTC
>>>>>> 2011
>>>>>> x86_64 x86_64 x86_64 GNU/Linux
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Andrew
>>>>> 
>>>>> Hey Andrew,
>>>>> 
>>>>> Sorry you're having issues with the 28599 and ixgbe.  I haven't
>>>>> done
>> much with the PF_RING networking driver but maybe we can see what is
>> going on with the ixgbe driver.  It would help to know a little be more
>> information like:
>>>>> 
>>>>> - What there any interesting system log messages of note?
>>>>> 
>>>>> - How are your interrupt being divided among your queue's (cat
>> /proc/interrupts)?  I know your testing with just one CPU are you also
>> just using one queue or affinizing one to that CPU?
>>>>> 
>>>>> - Could you provide the lspic -vvv output. To verify you NIC is
>> getting a PCIe x8 connection.
>>>>> 
>>>>> - What kind of cpu usage are you seeing if you don't use just the
>> base driver running at line rate with something like netperf/iperf?
>>>>> 
>>>>> - Have you attempted this without DCA? Like I said above I don't
>> have much experience with PF_RING so I may be missing some fundamental
>> advantage it is suppose to gain from operation with DCA in this mode.
>>>>> 
>>>>> These are just off the top of my head if I think of anything else
>> I'll let you know.
>>>>> 
>>>>> Thanks,
>>>>> -Don Skidmore <donald.c.skidm...@intel.com>
>>>> 
>>>> ---
>>>> 
>>>> "Debugging is twice as hard as writing the code in the first place.
>>>> Therefore, if you write the code as cleverly as possible, you are,
>>>> by definition, not smart enough to debug it. - Brian W. Kernighan
>>>> 

------------------------------------------------------------------------------
Doing More with Less: The Next Generation Virtual Desktop 
What are the key obstacles that have prevented many mid-market businesses
from deploying virtual desktops?   How do next-generation virtual desktops
provide companies an easier-to-deploy, easier-to-manage and more affordable
virtual desktop model.http://www.accelacomm.com/jaw/sfnl/114/51426474/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to