On 31 October 2018 at 05:38, Zeng, Star <[email protected]> wrote:
> Good feedback.
>
> On 2018/10/30 20:50, Leif Lindholm wrote:
>>
>> On Tue, Oct 30, 2018 at 09:39:24AM -0300, Ard Biesheuvel wrote:
>>>
>>> (add back the list)
>>
>>
>> Oi! Go back on holiday!
>>
>>> On 30 October 2018 at 09:07, Cohen, Eugene <[email protected]> wrote:
>>>>
>>>> Has this patch been tested on a system that does not have coherent DMA?
>>>>
>>>> It's not clear that this change would actually be faster on a system of
>>>> that
>>>> type since using common buffers imply access to uncached memory.
>>>> Depending
>>>> on the access patterns the uncached memory access could be more time
>>>> consuming than cache maintenance operations.
>
>
> The change/idea was based on the statement below.
>   ///
>   /// Provides both read and write access to system memory by both the
> processor and a
>   /// bus master. The buffer is coherent from both the processor's and the
> bus master's point of view.
>   ///
>   EfiPciIoOperationBusMasterCommonBuffer,
>
> Thanks for raising case about uncached memory access. But after checking the
> code, for Intel VTd case
> https://github.com/tianocore/edk2/blob/master/IntelSiliconPkg/Feature/VTd/IntelVTdDxe/BmDma.c#L460
> (or no IOMMU case
> https://github.com/tianocore/edk2/blob/master/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciRootBridgeIo.c#L1567),
> the common buffer is just normal memory buffer.
> If someone can help do some test/collect some data on a system using common
> buffers imply access to uncached memory, that will be great.
>

OK, so first of all, can anyone explain to me under which
circumstances interrupt transfers are a bottleneck? I'd assume that
anything throughput bound would use bulk endpoints.

Also, since the Map/Unmap calls are only costly when using an IOMMU,
could we simply revert to the old behavior if mIoMmu == NULL?

>>>
>>> I haven't had time to look at these patches yet.
>>>
>>> I agree with Eugene's concern: the directional DMA routines are much
>>> more performant on implementations with non-coherent DMA, and so
>>> common buffers should be avoided unless we are dealing with data
>>> structures that are truly shared between the CPU and the device.
>>>
>>> Since this is obviously not the case here, could we please have some
>>> numbers about the performance improvement we are talking about here?
>>> Would it be possible to improve the IOMMU handling code instead?
>
>
> We collected the data below on a platform with release image and Intel VTd
> enabled.
>
> The image size of EhciDxe or XhciDxe can reduce about 120+ bytes.
>
> EHCI without the patch:
> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> S0000B00D1DF0        446        2150           4           2         963
>
> EHCI with the patch:
> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> S0000B00D1DF0        270         742           2           2          41
>
> XHCI without the patch:
> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> S0000B00D14F0        215         603           2           2          52
>
> XHCI with the patch:
> ==[ Cumulative ]========
> (Times in microsec.)     Cumulative   Average     Shortest    Longest
>    Name         Count     Duration    Duration    Duration    Duration
> -------------------------------------------------------------------------------
> S0000B00D14F0         95         294           3           2          52
>
> I believe the performance data really depends on
> 1. How many AsyncInterruptTransfer handlers (the number of USB keyboard
> and/or USB bluetooth keyboard?)
> 2. Data size (for flushing data from PCI controller specific address to
> mapped system memory address *in original code*)
> 3. The performance of IoMmu->SetAttribute (for example, the SetAttribute
> operation on Intel VTd engine caused by the unmap and map for flushing data
> *in original code*, the SetAttribute operation on IntelVTd engine will
> involve FlushPageTableMemory, InvalidatePageEntry and etc)
>

OK, so there is room for improvement here: there is no reason the
IOMMU driver couldn't cache mappings, or do some other optimizations
that would make mapping the same memory repeatedly less costly.

>>
>> On an unrelated note to the concerns above:
>> Why has a fundamental change to the behaviour of one of the industry
>> standard drivers been pushed at the very end of the stable cycle?
>
>
> We thought it was a simple improvement but not fundamental change before
> Eugene and Ard raised the concern.
>
>
> Thanks,
> Star
>
>>
>> Regards,
>>
>> Leif
>>
>
_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to