On Aug 22, 2013, at 6:41 AM, Andrew Fish <af...@apple.com> wrote:

> 
> 
> On Aug 22, 2013, at 5:58 AM, "Cohen, Eugene" <eug...@hp.com> wrote:
> 
>> Thanks for the responses Siyuan and Andrew. 
>>  
>> I think I understand your explanation -- to get the payload aligned properly 
>> so higher layers can get the best performance and not necessarily align the 
>> start of the frame itself.  Do you have some data you can share on how much 
>> improvement aligning the payload has?  I would assume network performance in 
>> UEFI would be limited more by the latency of timer tick polling (since we 
>> don’t get real interrupts) rather than payload alignment.
>>  
>> DMA double-buffering is not happening.  The UEFI network driver we’re using 
>> (from one of the big networking guys) uses common buffer mappings instead.  
>> Because of the maturity of the network driver I don’t think it’s reasonable 
>> to ask the vendor to change their driver’s DMA scheme to use BusMasterRead 
>> and BusMasterWrite instead of common buffers (it could even be impossible 
>> because of HW limitations).
> 
> I don't quite understand how this works. If the CommonBuffer becomes the user 
> data where does Unmap() happen. It seems like this driver would leak uncached 
> memory? CommonBuffer means you are going to keep doing DMA into the buffer, 
> not that you have passed it off to the consumer?
> 

And on a system with an IOMMU this would be a potential security issue. 

Thanks,

Andrew Fish

> Thanks,
> 
> Andrew Fish
> 
>>   For our systems which do not support cache coherent DMA (ARM) the common 
>> buffers must be uncached.  The common buffers themselves are accessed in an 
>> aligned manner but the caller’s (cached) buffer is unaligned for the reasons 
>> we’re discussion.  So this forces a CopyMem from an aligned uncached 
>> location, to an unaligned cached location.  The memory copy code must 
>> downshift to a byte copy because of this misalignment and we get horrible 
>> performance (byte accesses to uncached memory regions are the worst possible 
>> workload).  I experimented changing the padding size from 6 to 8 and then 
>> performance improved significantly since the CopyMem could operate 
>> efficiently.
>>  
>> So it looks like we have two competing optimizations.  As you can imagine, 
>> on my platform the slow down from the uncached byte copy is far worse than 
>> the misaligned accesses to the cached IP protocol fields.  Is there some way 
>> we can address both concerns?  Here are some options I can think of:
>>  
>> 1.       Add some parameter (PCD value) to configure MNP to either optimize 
>> for aligned payload or aligned frame
>> 2.       Add the option to double-buffer so the first CopyMem (from uncached 
>> to cached) is frame-aligned and then do a second CopyMem to a buffer that is 
>> payload-aligned.
>> a.       This is really no different than if BusMasterRead/BusMasterWrite 
>> double-buffering is used, it would just need to be done somewhere above the 
>> driver, maybe in the SNP driver on top of UNDI.  Unfortunately there is no 
>> DMA Unmap() call in this common buffer case that we can use to add the 
>> additional CopyMem so it would have to be explicit.
>> 3.       Analyze the performance benefit of the aligned payload and if it’s 
>> not significant enough, abandon that approach and just use frame-aligned 
>> buffers (we need data)
>> 4.       Extend some protocol interfaces so that higher layers can ask lower 
>> layers what the required alignment is (like IoAlign in BLOCK_IO).  So on our 
>> platform we would say that frame alignment on 4 bytes is required.  Perhaps 
>> on X64 it would be payload alignment on 4 bytes instead.
>>  
>> 1, 3, and 4 are the best performing options since they avoid the need for an 
>> additional CopyMem so those would be my preference.  #1 has the downside 
>> that we’re tuning for a particular DMA and driver scheme with a PCD value 
>> for a hardware-independent service (not the greatest architectural 
>> approach).  If we decide to pursue #4 in the long term it would be helpful 
>> to me to do #1 in the short term still.
>>  
>> Do you have other options or preferences for which approach is used?
>>  
>> Eugene
>>  
>> From: Andrew Fish [mailto:af...@apple.com] 
>> Sent: Thursday, August 22, 2013 1:38 AM
>> To: edk2-devel@lists.sourceforge.net
>> Cc: Cohen, Eugene; edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>  
>> 
>> 
>> Sent from my iPhone
>> 
>> On Aug 22, 2013, at 12:15 AM, "Fu, Siyuan" <siyuan...@intel.com> wrote:
>> 
>> Hi, Eugene
>>  
>> The PaddingSize is in order to make the packet data (exclude the media 
>> header) 4-byte aligned when we tries to receive a packet.
>> When MNP driver calls the Snp.Receive() interface, both the media header and 
>> the data will be placed to the *Buffer*. Use IP packet over Ethernet for 
>> example, the media header is 14 bytes length (2 * 6 bytes MAC address + 2 
>> bytes protocol type), then the IP4 header which immediately following the 
>> media header. The EFI network stack is designed to make the minimum times 
>> memory copy, so most of the upper layer drivers will operate on this buffer 
>> directly.
>> Thus we have 2 choices,
>> (1)    If *Buffer* passed to Snp.Receive() is 4-byte aligned, the packet 
>> data will start at a non-dword aligned address. Since most network protocols 
>> are designed with alignment consideration, the upper layer protocols, like 
>> IP, UDP, TCP data items, will also start at a non-dword aligned address. I 
>> think parse these data on unaligned address will also have performance issue.
>> (2)    If we make the packet data aligned, the *Buffer* is unaligned, it 
>> will bring performance issue as your said. Fortunately this unaligned memory 
>> copy only happen once on each packet (only in SNP or UNDI driver).
>> I think that’s why MNP driver tries to align a later part of Ethernet 
>> packet. And I have tested the PXE boot and TCP download on my side and do 
>> not see clear differences between them (maybe it’s because my UNDI driver do 
>> not use DMA?).
>>  
>>  
>> ARM platforms have to do DMA into uncached buffers. This is why it is so 
>> important to follow the EFI DMA rules.
>>  
>> Eugene have you tried double buffering the data into a cached buffer? I 
>> wonder if you have a lot of small misaligned accesses to uncached memory, 
>> and a single copy to a cached buffer would be less overhead. Or maybe you 
>> could enable caching on the buffer after DMA completes?
>> 
>> 
>> Hope my explanation is helpful.
>>  
>> Fu, Siyuan
>> From: Cohen, Eugene [mailto:eug...@hp.com] 
>> Sent: Thursday, August 22, 2013 11:46 AM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>  
>> Ruth,
>>  
>> The performance impact is related to unaligned copies to uncached buffers.  
>> So I suppose any machine that must make use of uncached buffers for DMA 
>> coherency would have the same slowdown, although I have not had a reason to 
>> measure this on other platforms.
>>  
>> The code seems strange since for a normal driver (UNDI, SNP) the receive 
>> buffer address passed down is no longer 4-byte aligned.  Apparently this 
>> code is trying to align a later part of the ethernet packet (the payload, 
>> not the header) but I can’t think of a reason for this.
>>  
>> Eugene
>>  
>> From: Li, Ruth [mailto:ruth...@intel.com] 
>> Sent: Wednesday, August 21, 2013 7:55 PM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: Re: [edk2] MNP PaddingSize Question
>>  
>> Hi Eugene,
>>  
>> Below pieces of code has been there for long time. We need some time to 
>> evaluate it and see possible impact.
>>  
>> BTW, can I know whether you see the performance impact only over your 
>> machine? Or generally all machine?
>>  
>> Thanks,
>> Ruth
>> From: Cohen, Eugene [mailto:eug...@hp.com] 
>> Sent: Tuesday, August 20, 2013 3:56 AM
>> To: edk2-devel@lists.sourceforge.net
>> Subject: [edk2] MNP PaddingSize Question
>>  
>> I’ve been tracking down a performance issue and have isolated it to this 
>> piece of MNP initialization code:
>>  
>>   //
>>   // Make sure the protocol headers immediately following the media header
>>   // 4-byte aligned, and also preserve additional space for VLAN tag
>>   //
>>   MnpDeviceData->PaddingSize = ((4 - SnpMode->MediaHeaderSize) & 0x3) + 
>> NET_VLAN_TAG_LEN;
>>  
>> On my system this is coming up with ‘6’ (MediaHeaderSize = 0xE) which is 
>> causing performance issues since some of the memory copies to the resulting 
>> non-dword aligned addresses are slower.  As an experiment I tried bumping 
>> this number to ‘8’ and things worked well.
>>  
>> This value is used later when NET_BUFs are being allocated:
>>  
>>     if (MnpDeviceData->PaddingSize > 0) {
>>       //
>>       // Pad padding bytes before the media header
>>       //
>>       NetbufAllocSpace (Nbuf, MnpDeviceData->PaddingSize, NET_BUF_TAIL);
>>       NetbufTrim (Nbuf, MnpDeviceData->PaddingSize, NET_BUF_HEAD);
>>     }
>>  
>> Can someone explain the purpose of PaddingSize and how that affects the 
>> later processing of packets?  Is this number a minimum value and is ok to be 
>> larger?
>>  
>> Thanks,
>>  
>> Eugene
>>  
>> ------------------------------------------------------------------------------
>> Introducing Performance Central, a new site from SourceForge and 
>> AppDynamics. Performance Central is your source for news, insights, 
>> analysis and resources for efficient Application Performance Management. 
>> Visit us today!
>> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
>> _______________________________________________
>> edk2-devel mailing list
>> edk2-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/edk2-devel
> 
> ------------------------------------------------------------------------------
> Introducing Performance Central, a new site from SourceForge and 
> AppDynamics. Performance Central is your source for news, insights, 
> analysis and resources for efficient Application Performance Management. 
> Visit us today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk_______________________________________________
> edk2-devel mailing list
> edk2-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/edk2-devel

------------------------------------------------------------------------------
Introducing Performance Central, a new site from SourceForge and 
AppDynamics. Performance Central is your source for news, insights, 
analysis and resources for efficient Application Performance Management. 
Visit us today!
http://pubads.g.doubleclick.net/gampad/clk?id=48897511&iu=/4140/ostg.clktrk
_______________________________________________
edk2-devel mailing list
edk2-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/edk2-devel

Reply via email to