On 2018/11/6 22:37, Zeng, Star wrote:
On 2018/11/6 17:49, Ard Biesheuvel wrote:
On 31 October 2018 at 05:38, Zeng, Star <[email protected]> wrote:
Good feedback.

On 2018/10/30 20:50, Leif Lindholm wrote:

On Tue, Oct 30, 2018 at 09:39:24AM -0300, Ard Biesheuvel wrote:

(add back the list)


Oi! Go back on holiday!

On 30 October 2018 at 09:07, Cohen, Eugene <[email protected]> wrote:

Has this patch been tested on a system that does not have coherent DMA?

It's not clear that this change would actually be faster on a system of
that
type since using common buffers imply access to uncached memory.
Depending
on the access patterns the uncached memory access could be more time
consuming than cache maintenance operations.


The change/idea was based on the statement below.
   ///
   /// Provides both read and write access to system memory by both the
processor and a
   /// bus master. The buffer is coherent from both the processor's and the
bus master's point of view.
   ///
   EfiPciIoOperationBusMasterCommonBuffer,

Thanks for raising case about uncached memory access. But after checking the
code, for Intel VTd case
https://github.com/tianocore/edk2/blob/master/IntelSiliconPkg/Feature/VTd/IntelVTdDxe/BmDma.c#L460
(or no IOMMU case
https://github.com/tianocore/edk2/blob/master/MdeModulePkg/Bus/Pci/PciHostBridgeDxe/PciRootBridgeIo.c#L1567),
the common buffer is just normal memory buffer.
If someone can help do some test/collect some data on a system using common
buffers imply access to uncached memory, that will be great.


OK, so first of all, can anyone explain to me under which
circumstances interrupt transfers are a bottleneck? I'd assume that
anything throughput bound would use bulk endpoints.

Also, since the Map/Unmap calls are only costly when using an IOMMU,
could we simply revert to the old behavior if mIoMmu == NULL?


I haven't had time to look at these patches yet.

I agree with Eugene's concern: the directional DMA routines are much
more performant on implementations with non-coherent DMA, and so
common buffers should be avoided unless we are dealing with data
structures that are truly shared between the CPU and the device.

Since this is obviously not the case here, could we please have some
numbers about the performance improvement we are talking about here?
Would it be possible to improve the IOMMU handling code instead?


We collected the data below on a platform with release image and Intel VTd
enabled.

The image size of EhciDxe or XhciDxe can reduce about 120+ bytes.

EHCI without the patch:
==[ Cumulative ]========
(Times in microsec.)     Cumulative   Average     Shortest    Longest
    Name         Count     Duration    Duration    Duration    Duration
-------------------------------------------------------------------------------
S0000B00D1DF0        446        2150           4           2         963

EHCI with the patch:
==[ Cumulative ]========
(Times in microsec.)     Cumulative   Average     Shortest    Longest
    Name         Count     Duration    Duration    Duration    Duration
-------------------------------------------------------------------------------
S0000B00D1DF0        270         742           2           2          41

XHCI without the patch:
==[ Cumulative ]========
(Times in microsec.)     Cumulative   Average     Shortest    Longest
    Name         Count     Duration    Duration    Duration    Duration
-------------------------------------------------------------------------------
S0000B00D14F0        215         603           2           2          52

XHCI with the patch:
==[ Cumulative ]========
(Times in microsec.)     Cumulative   Average     Shortest    Longest
    Name         Count     Duration    Duration    Duration    Duration
-------------------------------------------------------------------------------
S0000B00D14F0         95         294           3           2          52

I believe the performance data really depends on
1. How many AsyncInterruptTransfer handlers (the number of USB keyboard
and/or USB bluetooth keyboard?)
2. Data size (for flushing data from PCI controller specific address to
mapped system memory address *in original code*)
3. The performance of IoMmu->SetAttribute (for example, the SetAttribute
operation on Intel VTd engine caused by the unmap and map for flushing data
*in original code*, the SetAttribute operation on IntelVTd engine will
involve FlushPageTableMemory, InvalidatePageEntry and etc)


OK, so there is room for improvement here: there is no reason the
IOMMU driver couldn't cache mappings, or do some other optimizations
that would make mapping the same memory repeatedly less costly.

The unmap/map with IOMMU will direct to SetAttribute that will disallow/allow DMA memory access. The IOMMU driver is hard to predict the sequence of unmap/map operations. Do you have more detail about the optimizations?

Could you take a try with the patch on the platform for the case you and Eugene mentioned?

Anyway, I am going to revert the patches (3/4 and 4/4, since 1/4 and 2/4 have no functionality impact) since the time point is a little sensitive as it is near edk2-stable201811.

I have reverted the patch 3/4 and 4/4 at https://github.com/tianocore/edk2/compare/1ed6498...d98fc9a, and we can continue the discussion.



Thanks,
Star



On an unrelated note to the concerns above:
Why has a fundamental change to the behaviour of one of the industry
standard drivers been pushed at the very end of the stable cycle?


We thought it was a simple improvement but not fundamental change before
Eugene and Ard raised the concern.


Thanks,
Star


Regards,

Leif



_______________________________________________
edk2-devel mailing list
[email protected]
https://lists.01.org/mailman/listinfo/edk2-devel

Reply via email to