在 2021/7/15 9:23, Lu Baolu 写道:
On 7/14/21 10:24 PM, Georgi Djakov wrote:
On 16.06.21 16:38, Georgi Djakov wrote:
When unmapping a buffer from an IOMMU domain, the IOMMU framework
unmaps
the buffer at a granule of the largest page size that is supported by
the IOMMU hardware and fits within the buffer. For every block that
is unmapped, the IOMMU framework will call into the IOMMU driver, and
then the io-pgtable framework to walk the page tables to find the entry
that corresponds to the IOVA, and then unmaps the entry.
This can be suboptimal in scenarios where a buffer or a piece of a
buffer can be split into several contiguous page blocks of the same
size.
For example, consider an IOMMU that supports 4 KB page blocks, 2 MB
page
blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is
being
unmapped at IOVA 0. The current call-flow will result in 4 indirect
calls,
and 2 page table walks, to unmap 2 entries that are next to each
other in
the page-tables, when both entries could have been unmapped in one shot
by clearing both page table entries in the same call.
The same optimization is applicable to mapping buffers as well, so
these patches implement a set of callbacks called unmap_pages and
map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
an IOVA range that consists of a number of pages of the same
page size that is supported by the IOMMU hardware, and allows for
manipulating multiple page table entries in the same set of indirect
calls. The reason for introducing these callbacks is to give other
IOMMU
drivers/io-pgtable formats time to change to using the new
callbacks, so
that the transition to using this approach can be done piecemeal.
Hi Will,
Did you get a chance to look at this patchset? Most patches are already
acked/reviewed and all still applies clean on rc1.
I also have the ops->[un]map_pages implementation for the Intel IOMMU
driver. I will post them once the iommu/core part get applied.
I also implement those callbacks on ARM SMMUV3 based on this series, and
use dma_map_benchmark to have a test on
the latency of map/unmap as follows, and i think it promotes much on the
latency of map/unmap. I will also plan to post
the implementations for ARM SMMUV3 after this series are applied.
t = 1(thread = 1):
before opt(us) after opt(us)
g=1(4K size) 0.1/1.3 0.1/0.8
g=2(8K size) 0.2/1.5 0.2/0.9
g=4(16K size) 0.3/1.9 0.1/1.1
g=8(32K size) 0.5/2.7 0.2/1.4
g=16(64K size) 1.0/4.5 0.2/2.0
g=32(128K size) 1.8/7.9 0.2/3.3
g=64(256K size) 3.7/14.8 0.4/6.1
g=128(512K size) 7.1/14.7 0.5/10.4
g=256(1M size) 14.0/53.9 0.8/19.3
g=512(2M size) 0.2/0.9 0.2/0.9
g=1024(4M size) 0.5/1.5 0.4/1.0
t = 10(thread = 10):
before opt(us) after opt(us)
g=1(4K size) 0.3/7.0 0.1/5.8
g=2(8K size) 0.4/6.7 0.3/6.0
g=4(16K size) 0.5/6.3 0.3/5.6
g=8(32K size) 0.5/8.3 0.2/6.3
g=16(64K size) 1.0/17.3 0.3/12.4
g=32(128K size) 1.8/36.0 0.2/24.2
g=64(256K size) 4.3/67.2 1.2/46.4
g=128(512K size) 7.8/93.7 1.3/94.2
g=256(1M size) 14.7/280.8 1.8/191.5
g=512(2M size) 3.6/3.2 1.5/2.5
g=1024(4M size) 2.0/3.1 1.8/2.6
Best regards,
baolu
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu
.
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu