On Aug 22, 2013, at 6:41 AM, Andrew Fish <af...@apple.com> wrote:
>
>
> On Aug 22, 2013, at 5:58 AM, "Cohen, Eugene" <eug...@hp.com> wrote:
>
>> Thanks for the responses Siyuan and Andrew.
>>
>> I think I understand your explanation -- to get the payload aligned properly
>> so higher layers can get the best performance and not necessarily align the
>> start of the frame itself. Do you have some data you can share on how much
>> improvement aligning the payload has? I would assume network performance in
>> UEFI would be limited more by the latency of timer tick polling (since we
>> don’t get real interrupts) rather than payload alignment.
>>
>> DMA double-buffering is not happening. The UEFI network driver we’re using
>> (from one of the big networking guys) uses common buffer mappings instead.
>> Because of the maturity of the network driver I don’t think it’s reasonable
>> to ask the vendor to change their driver’s DMA scheme to use BusMasterRead
>> and BusMasterWrite instead of common buffers (it could even be impossible
>> because of HW limitations).
>
> I don't quite understand how this works. If the CommonBuffer becomes the user
> data where does Unmap() happen. It seems like this driver would leak uncached
> memory? CommonBuffer means you are going to keep doing DMA into the buffer,
> not that you have passed it off to the consumer?
>
And on a system with an IOMMU this would be a potential security issue.
Thanks,
Andrew Fish
> Thanks,
>
> Andrew Fish
>
>> For our systems which do not support cache coherent DMA (ARM) the common
>> buffers must be uncached. The common buffers themselves are accessed in an
>> aligned manner but the caller’s (cached) buffer is unaligned for the reasons
>> we’re discussion. So this forces a CopyMem from an aligned uncached
>> location, to an unaligned cached location. The memory copy code must
>> downshift to a byte copy because of this misalignment and we get horrible
>> performance (byte accesses to uncached memory regions are the worst possible
>> workload). I experimented changing the padding size from 6 to 8 and then
>> performance improved significantly since the CopyMem could operate
>> efficiently.
>>
>> So it looks like we have two competing optimizations. As you can imagine,
>> on my platform the slow down from the uncached byte copy is far worse than
>> the misaligned accesses to the cached IP protocol fields. Is there some way
>> we can address both concerns? Here are some options I can think of:
>>
>> 1. Add some parameter (PCD value) to configure MNP to either optimize
>> for aligned payload or aligned frame
>> 2. Add the option to double-buffer so the first CopyMem (from uncached
>> to cached) is frame-aligned and then do a second CopyMem to a buffer that is
>> payload-aligned.
>> a. This is really no different than if BusMasterRead/BusMasterWrite
>> double-buffering is used, it would just need to be done somewhere above the
>> driver, maybe in the SNP driver on top of UNDI. Unfortunately there is no
>> DMA Unmap() call in this common buffer case that we can use to add the
>> additional CopyMem so it would have to be explicit.
>> 3. Analyze the performance benefit of the aligned payload and if it’s
>> not significant enough, abandon that approach and just use frame-aligned
>> buffers (we need data)
>> 4. Extend some protocol interfaces so that higher layers can ask lower
>> layers what the required alignment is (like IoAlign in BLOCK_IO). So on our
>> platform we would say that frame alignment on 4 bytes is required. Perhaps
>> on X64 it would be payload alignment on 4 bytes instead.
>>
>> 1, 3, and 4 are the best performing options since they avoid the need for an
>> additional CopyMem so those would be my preference. #1 has the downside
>> that we’re tuning for a particular DMA and driver scheme with a PCD value
>> for a hardware-independent service (not the greatest architectural
>> approach). If we decide to pursue #4 in the long term it would be helpful
>> to me to do #1 in the short term still.
>>
>> Do you have other options or preferences for which approach is used?
>>
>> Eugene
>>
>> From: Andrew Fish [mailto:af...@apple.com]
>> Sent: Thursday, August 22, 2013 1:38 AM
>> To: edk2-devel@lists.sourceforge.net
>> Cc: Cohen, Eugene; edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>
>>
>>
>> Sent from my iPhone
>>
>> On Aug 22, 2013, at 12:15 AM, "Fu, Siyuan" <siyuan...@intel.com> wrote:
>>
>> Hi, Eugene
>>
>> The PaddingSize is in order to make the packet data (exclude the media
>> header) 4-byte aligned when we tries to receive a packet.
>> When MNP driver calls the Snp.Receive() interface, both the media header and
>> the data will be placed to the *Buffer*. Use IP packet over Ethernet for
>> example, the media header is 14 bytes length (2 * 6 bytes MAC address + 2
>> bytes protocol type), then the IP4 header which immediately following the
>> media header. The EFI network stack is designed to make the minimum times
>> memory copy, so most of the upper layer drivers will operate on this buffer
>> directly.
>> Thus we have 2 choices,
>> (1) If *Buffer* passed to Snp.Receive() is 4-byte aligned, the packet
>> data will start at a non-dword aligned address. Since most network protocols
>> are designed with alignment consideration, the upper layer protocols, like
>> IP, UDP, TCP data items, will also start at a non-dword aligned address. I
>> think parse these data on unaligned address will also have performance issue.
>> (2) If we make the packet data aligned, the *Buffer* is unaligned, it
>> will bring performance issue as your said. Fortunately this unaligned memory
>> copy only happen once on each packet (only in SNP or UNDI driver).
>> I think that’s why MNP driver tries to align a later part of Ethernet
>> packet. And I have tested the PXE boot and TCP download on my side and do
>> not see clear differences between them (maybe it’s because my UNDI driver do
>> not use DMA?).
>>
>>
>> ARM platforms have to do DMA into uncached buffers. This is why it is so
>> important to follow the EFI DMA rules.
>>
>> Eugene have you tried double buffering the data into a cached buffer? I
>> wonder if you have a lot of small misaligned accesses to uncached memory,
>> and a single copy to a cached buffer would be less overhead. Or maybe you
>> could enable caching on the buffer after DMA completes?
>>
>>
>> Hope my explanation is helpful.
>>
>> Fu, Siyuan
>> From: Cohen, Eugene [mailto:eug...@hp.com]
>> Sent: Thursday, August 22, 2013 11:46 AM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>
>> Ruth,
>>
>> The performance impact is related to unaligned copies to uncached buffers.
>> So I suppose any machine that must make use of uncached buffers for DMA
>> coherency would have the same slowdown, although I have not had a reason to
>> measure this on other platforms.
>>
>> The code seems strange since for a normal driver (UNDI, SNP) the receive
>> buffer address passed down is no longer 4-byte aligned. Apparently this
>> code is trying to align a later part of the ethernet packet (the payload,
>> not the header) but I can’t think of a reason for this.
>>
>> Eugene
>>
>> From: Li, Ruth [mailto:ruth...@intel.com]
>> Sent: Wednesday, August 21, 2013 7:55 PM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>
>> Hi Eugene,
>>
>> Below pieces of code has been there for long time. We need some time to
>> evaluate it and see possible impact.
>>
>> BTW, can I know whether you see the performance impact only over your
>> machine? Or generally all machine?
>>
>> Thanks,
>> Ruth
>> From: Cohen, Eugene [mailto:eug...@hp.com]
>> Sent: Tuesday, August 20, 2013 3:56 AM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: [edk2] MNP PaddingSize Question
>>
>> I’ve been tracking down a performance issue and have isolated it to this
>> piece of MNP initialization code:
>>
>> //
>> // Make sure the protocol headers immediately following the media header
>> // 4-byte aligned, and also preserve additional space for VLAN tag
>> //
>> MnpDeviceData->PaddingSize = ((4 - SnpMode->MediaHeaderSize) & 0x3) +
>> NET_VLAN_TAG_LEN;
>>
>> On my system this is coming up with ‘6’ (MediaHeaderSize = 0xE) which is
>> causing performance issues since some of the memory copies to the resulting
>> non-dword aligned addresses are slower. As an experiment I tried bumping
>> this number to ‘8’ and things worked well.
>>
>> This value is used later when NET_BUFs are being allocated:
>>
>> if (MnpDeviceData->PaddingSize > 0) {
>> //
>> // Pad padding bytes before the media header
>> //
>> NetbufAllocSpace (Nbuf, MnpDeviceData->PaddingSize, NET_BUF_TAIL);
>> NetbufTrim (Nbuf, MnpDeviceData->PaddingSize, NET_BUF_HEAD);
>> }
>>
>> Can someone explain the purpose of PaddingSize and how that affects the
>> later processing of packets? Is this number a minimum value and is ok to be
>> larger?
>>
>> Thanks,
>>
>> Eugene
>>
>> ------------------------------------------------------------------------------
>> Introducing Performance Central, a new site from SourceForge and
>> AppDynamics. Performance Central is your source for news, insights,
>> analysis and resources for efficient Application Performance Management.
>> Visit us today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>> _______________________________________________
>> edk2-devel mailing list
>> edk2-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/edk2-devel
>
> ------------------------------------------------------------------------------
> Introducing Performance Central, a new site from SourceForge and
> AppDynamics. Performance Central is your source for news, insights,
> analysis and resources for efficient Application Performance Management.
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk_______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/edk2-devel
------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and
AppDynamics. Performance Central is your source for news, insights,
analysis and resources for efficient Application Performance Management.
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel