Re: [E1000-devel] 64B Aligned DMA writes with 82599

Alexander Duyck Tue, 01 May 2012 08:49:30 -0700

On 04/30/2012 03:46 PM, Moris Bangoura wrote:
> Dne 30.4.2012 19:56, Alexander Duyck napsal(a):
>> On 04/30/2012 02:02 AM, Moris Bangoura wrote:
>>> Hi,
>>>
>>> we are working with modified ixgbe drivers (packetshader - ixgbe
>>> 2.0.38.2, netmap - ixgbe 3.9.15)  that allow receiving/sending small
>>> frames in wirespeed.
>>>
>>> In our lab  we use 2 CPU NUMA architecture (Xeon CPU, Intel 5520
>>> chipset), 2x dual 10GbE 82599 cards.
>>>
>>> There is a problem with receiving small frames with length, that is not
>>> multiply of 64B (without or with 4B CRC, depending if RDRXCTL.CRCStrip
>>> and HLREG0.RXCRCSTRP register is  set to 1 or 0).
>>>
>>> We suspect, that problem is somewhere in 82599 DMA engine, Intel 5520
>>> IOH, QPI or CPU cache line.
>>>
>>> What we discovered:
>>>
>>> 1. If CRCStrip reg is set to 1:
>>> - RX of 60B(+4B CRC) frame is  6,9  Mpps (PCIe TLP payload is 60B)
>>> - RX of 64B(+4B CRC) frame is 14,2 Mpps (PCIe TLP payload is 64B)
>>> ->  OK,
>>> wirespeed.
>>>
>>> 2. If CRCStrip reg is set to 0:
>>> - RX of 60B(+4B CRC) is<  14,8 Mpps (PCIe TLP payload is 64B) ->  OK,
>>> wirespeed.
>>> - RX of 61B(+4B CRC) is<  6,9 Mpps   (PCIe TLP payload is 65B)
>>>
>>> Is there some possible workaround, so 82599 DMA engine always aligns
>>> length of Memory Write Request payload to be multiply of 64B?
>>>
>>> Example:
>>> 0. 64B frame is received on Rx MAC with CRCStrip reg set to 1.
>>> 1. The receive DMA fetches the next RX descriptor from the appropriate
>>> host memory ring to be used for the next
>>> received packet.
>>> 2. The receive DMA posts the packet appended with 4B (so Memory Write
>>> Request payload length is multiply of 64B) to the location indicated by
>>> the RX descriptor through the PCIe interface.
>>> 3. When the packet is placed into host memory, the receive DMA updates
>>> all the RX descriptor(s) that were used by the
>>> packet data (real non-appended packet length is reported via PKT_LEN).
>>> 4. The receive DMA writes back the RX descriptor content along with
>>> status bits that indicate the packet information
>>> including what offloads were done on that packet.
>>> 5. 82599 initiates an interrupt indicating, that new packet is ready in
>>> host memory. The host reads packet data (only PKT_LEN indicated by RX
>>> descriptor).
>>>
>>> Maybe there is some 82599 RX DMA register/bit that is not covered by
>>> 82599 datasheet (version 2.75).
>>>
>>> Regards,
>>>
>> Morris,
>>
>> Are you seeing this issue with both the 2.0.38.2 and 3.9.15 drivers, or
>> is this mainly with 2.0.38.2?  I just want to clarify since the 3.9.15
>> driver should be significantly more optimized than 2.0.38.2 driver.
>>
>> The behaviour you are describing sounds like an issue with partial cache
>> line writes.  This is an issue for most architectures because it
>> typically requires a read/modify/write cycle to write the cache line
>> instead of being just a direct write as in the case of a full cache line
>> write.  The 3.9.15 driver contains several updates since the 2.0.38.2 in
>> regards to partial cache line writes and will likely show much better
>> performance.  Specifically it will cut the number of partial cache line
>> writes in half by aligning the buffers with the start of a cache line.
>>
>> The hardware itself doesn't contain any workarounds for this, but I
>> would recommend testing with the 3.9.15 driver instead of the 2.0.38.2
>> driver as it will contain several software improvements that may help to
>> improve the performance.
>>
>> Thanks,
>>
>> Alex
> Hi,
>
> thank you for quick answer.
>
> Yes, the issue could be seen in both versions - same results.
>
> Each address for 82599 DMA write is aligned... each cell  of the RX
> packet data buffer ring in RAM is  fixed size (2048B) with 64B alignment.
> Also modified driver uses 64B aligned memcpy and prefetching...
>
> Do i understand it right, that partial cache line writes can not occur
> with 64B aligned address and packet for example 60B (64B ethernet
> packet without 4B CRC stripped)?
> See ftp://download.intel.com/design/intarch/PAPERS/321071.pdf.
>
> Maybe there is some register/bit for 82599 DMA RX, that is not covered
> by 82599 datasheet... and could alter 82599 DMA write.
>
> For example IXGBE_RDRXCTL_AGGDIS... are DMA RX writes somehow aggregated?
>
> Thank you,
>
Hi,


A partial cache line write occurs any time you are not writing the
complete cache line.  In the case of your 60B packet w/ 4B CRC not being
stripped you do not trigger a partial cache line write because you start
cache line aligned and you are writing one full cache line.

The problem with a partial cache line write is that it triggers a read
from system memory in order for the data to be merged.  The performance
difference you are seeing is likely due to this memory read.  One thing
you may want to look into is the configuration of your system memory. 
You will want to verify that all memory channels are populated with at
least 1 DIMM so that you can get the maximum throughput.

There are no registers/bits on the 82599 that allow it force a full
cache line write.  The AGGDIS bit you are referring to is a bit for
disabling receive side coalescing.  It is controlled through the ethtool
LRO flag.

Thanks,

Alex



------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Re: [E1000-devel] 64B Aligned DMA writes with 82599

Reply via email to