Hi Russell & All,
In many DMA streaming map/unmap use cases, lower-layer device drivers
completely have no idea how and when single/sg buffers are allocated and freed
by upper-layer filesystem, network protocol, mm management system etc. So the
only thing device drivers can do is constantly mapping the buffer before DMA
begins and unmapping the buffer when DMA is done.
This will dramatically increase the latency of dma_map_single/sg and
dma_unmap_single/sg when these APIs are bound with the IOMMU backend. As for
each map, iommu driver needs to allocate iova and do the map in iommu. And for
each unmap, it needs to free iova and unmap the buffer in iommu hardware. When
devices performing DMA are super-fast, for example, on 100GbE networks, the DMA
streaming map/unmap latency might become a critical system bottleneck.
In comparison to DMA streaming APIs, DMA consistent APIs using IOMMU backend
may show much better performance as the map is done when the buffer is
allocated and unmap is done when the buffer is freed. DMA can be done multiple
times before the buffers are freed by dma_free_coherent(). There is no such map
and unmap overhead for each separate DMA transfer as streaming APIs. The
typical work flow is like
dma_alloc_coherent->
doing DMA ->
doing DMA ->
doing DMA ->
.... /* DMA many times */
dma_free_coherent
However, the typical work flow for streaming DMA is like
dma_map_sg -> doing DMA -> dma_unmap_sg ->
dma_map_sg -> doing DMA -> dma_unmap_sg ->
dma_map_sg -> doing DMA -> dma_unmap_sg ->
.... /* map, DMA transfer, unmap many times */
Even though upper-layer software might use the same buffers multiple times, for
each single DMA transmission, map and unmap still need to be done by
lower-level drivers as lower-layer drivers don't know this fact.
A possible routine to improve the performance of stream APIs is like:
dma_map_sg ->
dma_sync_sg_for_device -> doing DMA ->
dma_sync_sg_for_device -> doing DMA ->
dma_sync_sg_for_device -> doing DMA ->
... -> /* sync between DMA and CPU many times */
dma_unmap_sg
For every single DMA, software only needs to do sync operations which are much
lighter that map and unmap. But this case is often not applicable to device
drivers as the buffers usually come from the upper-layer filesystem, network
protocol, mm management system etc. Device drivers have to work with the
assumption that the buffer will be freed immediately after DMA is done.
However, for those device drivers which are able to allocate and free the DMA
stream buffers by themselves, they will get benefits of reusing the same
buffers for doing DMA multiple times without map/unmap overhead.
I collected some latency data for iommu_dma_map_sg and iommu_dma_unmap_sg. In
the test case, zswap is calling acomp APIs to compress/decompress pages, and
comp/decomp is done by lower-level hardware ZIP driver.
root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_map_sg
Tracing 1 functions for "iommu_dma_map_sg"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 2274570 |*********************** |
2048 -> 4095 : 3896310 |****************************************|
4096 -> 8191 : 74499 | |
8192 -> 16383 : 4475 | |
16384 -> 32767 : 1519 | |
32768 -> 65535 : 480 | |
65536 -> 131071 : 286 | |
131072 -> 262143 : 18 | |
262144 -> 524287 : 2 | |
root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_unmap_sg
Tracing 1 functions for "iommu_dma_unmap_sg"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 56083 | |
4096 -> 8191 : 5232036 |****************************************|
8192 -> 16383 : 7723 | |
16384 -> 32767 : 1277 | |
32768 -> 65535 : 32 | |
65536 -> 131071 : 12 | |
131072 -> 262143 : 41 | |
In contrast, if we set iommu passthrough, the latency will be much better:
root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_map_sg
Tracing 1 functions for "dma_direct_map_sg"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 10798 | |
1024 -> 2047 : 1435035 |****************************************|
2048 -> 4095 : 13879 | |
4096 -> 8191 : 485 | |
8192 -> 16383 : 791 | |
16384 -> 32767 : 418 | |
32768 -> 65535 : 55 | |
65536 -> 131071 : 67 | |
131072 -> 262143 : 8 | |
root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_unmap_sg
Tracing 1 functions for "dma_direct_unmap_sg"... Hit Ctrl-C to end.
^C
nsecs : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 216 | |
1024 -> 2047 : 250849 |****************************************|
2048 -> 4095 : 54341 |******** |
4096 -> 8191 : 80 | |
8192 -> 16383 : 191 | |
16384 -> 32767 : 65 | |
In summary, the comparison is as below:
(1)map
iommu passthrough mainly 1-2us
iommu non-passthrough mainly 2-4us
(2)unmap
iommu passthrough mainly 1-2us
iommu non-passthrough mainly 4-8us
The below is the long function trace for each dma_map/unmap_sg while iommu is
enabled:
507.520069 | 53) | iommu_dma_map_sg() {
507.520070 | 53) 0.670 us | iommu_get_dma_domain();
507.520071 | 53) 0.610 us | iommu_dma_deferred_attach();
507.520072 | 53) | iommu_dma_alloc_iova.isra.26() {
507.520073 | 53) | alloc_iova_fast() {
507.520074 | 53) | _raw_spin_lock_irqsave() {
507.520074 | 53) 0.570 us | preempt_count_add();
507.520076 | 53) 2.060 us | }
507.520077 | 53) | _raw_spin_unlock_irqrestore() {
507.520077 | 53) 0.790 us | preempt_count_sub();
507.520079 | 53) 2.090 us | }
507.520079 | 53) 6.260 us | }
507.520080 | 53) 7.470 us | }
507.520081 | 53) | iommu_map_sg_atomic() {
507.520081 | 53) | __iommu_map_sg() {
507.520082 | 53) | __iommu_map() {
507.520082 | 53) 0.630 us | iommu_pgsize.isra.14();
507.520084 | 53) | arm_smmu_map() {
507.520084 | 53) | arm_lpae_map() {
507.520085 | 53) | __arm_lpae_map() {
507.520086 | 53) | __arm_lpae_map() {
507.520086 | 53) | __arm_lpae_map() {
507.520087 | 53) 0.930 us | __arm_lpae_map();
507.520089 | 53) 2.170 us | }
507.520089 | 53) 3.490 us | }
507.520090 | 53) 4.730 us | }
507.520090 | 53) 5.980 us | }
507.520091 | 53) 7.250 us | }
507.520092 | 53) 0.650 us | iommu_pgsize.isra.14();
507.520093 | 53) | arm_smmu_map() {
507.520093 | 53) | arm_lpae_map() {
507.520094 | 53) | __arm_lpae_map() {
507.520095 | 53) | __arm_lpae_map() {
507.520096 | 53) | __arm_lpae_map() {
507.520096 | 53) 0.630 us | __arm_lpae_map();
507.520098 | 53) 1.860 us | }
507.520098 | 53) 3.210 us | }
507.520099 | 53) 4.610 us | }
507.520099 | 53) 5.860 us | }
507.520100 | 53) 7.110 us | }
507.520101 | 53) + 18.740 us | }
507.520101 | 53) + 20.080 us | }
507.520102 | 53) + 21.320 us | }
507.520102 | 53) + 33.200 us | }
783.039976 | 48) | iommu_dma_unmap_sg() {
783.039977 | 48) | __iommu_dma_unmap() {
783.039978 | 48) 0.720 us | iommu_get_dma_domain();
783.039979 | 48) | iommu_unmap_fast() {
783.039980 | 48) | __iommu_unmap() {
783.039981 | 48) 0.740 us | iommu_pgsize.isra.14();
783.039982 | 48) | arm_smmu_unmap() {
783.039983 | 48) | arm_lpae_unmap() {
783.039984 | 48) | __arm_lpae_unmap() {
783.039985 | 48) | __arm_lpae_unmap() {
783.039985 | 48) | __arm_lpae_unmap() {
783.039986 | 48) | __arm_lpae_unmap() {
783.039988 | 48) 0.730 us |
arm_smmu_tlb_inv_page_nosync();
783.039989 | 48) 3.010 us | }
783.039990 | 48) 4.490 us | }
783.039991 | 48) 5.950 us | }
783.039991 | 48) 7.460 us | }
783.039992 | 48) 8.920 us | }
783.039993 | 48) + 10.380 us | }
783.039993 | 48) + 13.350 us | }
783.039994 | 48) + 14.820 us | }
783.039995 | 48) | arm_smmu_iotlb_sync() {
783.039996 | 48) | arm_smmu_tlb_inv_range() {
783.039996 | 48) | arm_smmu_cmdq_batch_add() {
783.039997 | 48) 0.760 us | arm_smmu_cmdq_build_cmd();
783.039999 | 48) 2.220 us | }
783.039999 | 48) | arm_smmu_cmdq_issue_cmdlist() {
783.040000 | 48) 0.530 us | arm_smmu_cmdq_build_cmd();
783.040001 | 48) 0.530 us |
__arm_smmu_cmdq_poll_set_valid_map.isra.40();
783.040002 | 48) 0.540 us |
__arm_smmu_cmdq_poll_set_valid_map.isra.40();
783.040004 | 48) | ktime_get() {
783.040004 | 48) 0.540 us | arch_counter_read();
783.040005 | 48) 1.570 us | }
783.040006 | 48) 6.880 us | }
783.040007 | 48) 0.830 us |
arm_smmu_atc_inv_domain.constprop.48();
783.040008 | 48) + 12.910 us | }
783.040009 | 48) + 14.370 us | }
783.040010 | 48) | iommu_dma_free_iova() {
783.040011 | 48) | free_iova_fast() {
783.040011 | 48) | _raw_spin_lock_irqsave() {
783.040012 | 48) 0.600 us | preempt_count_add();
783.040013 | 48) 2.000 us | }
783.040014 | 48) | _raw_spin_unlock_irqrestore() {
783.040015 | 48) 0.820 us | preempt_count_sub();
783.040016 | 48) 2.220 us | }
783.040018 | 48) 6.200 us | }
783.040019 | 48) 8.880 us | }
783.040020 | 48) + 42.540 us | }
783.040020 | 48) + 44.030 us | }
I am thinking several possible ways on decreasing or removing the latency of
DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an
existing option with possible safety issues, I won't discuss it in this mail.
1. provide bounce coherent buffers for streaming buffers.
As the coherent buffers keep the status of mapping, we can remove the overhead
of map and unmap for each single DMA operations. However, this solution
requires memory copy between stream buffers and bounce buffers. Thus it will
work only if copy is faster than map/unmap. Meanwhile, it will consume much
more memory bandwidth.
2.make upper-layer kernel components aware of the pain of iommu map/unmap
upper-layer fs, mm, networks can somehow let the lower-layer drivers know the
end of the life cycle of sg buffers. In zswap case, I have seen zswap always
use the same 2 pages as the destination buffers to save compressed page, but
the compressor driver still has to constantly map and unmap those same two
pages for every single compression since zswap and zip drivers are working in
two completely different software layers.
I am thinking some way as below, upper-layer kernel code can call:
sg_init_table(&sg...);
sg_mark_reusable(&sg....);
.... /* use the buffer many times */
....
sg_mark_stop_reuse(&sg);
After that, if low level drivers see "reusable" flag, it will realize the
buffer can be used multiple times and will not do map/unmap every time. it
means upper-layer components will further use the buffers and the same buffers
will probably be given to lower-layer drivers for new DMA transfer later. When
upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg
buffers, possibly by providing a unmap-callback to upper-layer components. For
zswap case, I have seen the same buffers are always re-used and zip driver maps
and unmaps it again and again. Shortly after the buffer is unmapped, it will be
mapped in the next transmission, almost without any time gap between unmap and
map. In case zswap can set the "reusable" flag, zip driver will save a lot of
time.
Meanwhile, for the safety of buffers, lower-layer drivers need to make certain
the buffers have already been unmapped in iommu before those buffers go back to
buddy for other users.
I don't think letting upper-layer components aware of the overhead of map and
unmap is elegant. But it might be something which deserves to be done for
performance reason. Upper-layer software which is friendly to lower-layer
driver might call sg_mark_reusable(&sg....). But it is not enforced, if
upper-layer components don't call the API, the current lower-level driver won't
be affected.
Please kindly give your comments on this proposal and provide your suggestions
on any possible way to improve the performance of DMA stream APIs with iommu
backend. I am glad to send a draft patch for "reusable" buffers if you think it
is not bad.
Best Regards
Barry
_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu