Hi Russell & All,

In many DMA streaming map/unmap use cases, lower-layer device drivers 
completely have no idea how and when single/sg buffers are allocated and freed 
by upper-layer filesystem, network protocol, mm management system etc. So the 
only thing device drivers can do is constantly mapping the buffer before DMA 
begins and unmapping the buffer when DMA is done.

This will dramatically increase the latency of dma_map_single/sg and 
dma_unmap_single/sg when these APIs are bound with the IOMMU backend. As for 
each map, iommu driver needs to allocate iova and do the map in iommu. And for 
each unmap, it needs to free iova and unmap the buffer in iommu hardware. When 
devices performing DMA are super-fast, for example, on 100GbE networks, the DMA 
streaming map/unmap latency might become a critical system bottleneck.

In comparison to DMA streaming APIs, DMA consistent APIs using IOMMU backend 
may show much better performance as the map is done when the buffer is 
allocated and unmap is done when the buffer is freed. DMA can be done multiple 
times before the buffers are freed by dma_free_coherent(). There is no such map 
and unmap overhead for each separate DMA transfer as streaming APIs. The 
typical work flow is like
dma_alloc_coherent-> 
doing DMA -> 
doing DMA ->
doing DMA ->
.... /* DMA many times */
dma_free_coherent

However, the typical work flow for streaming DMA is like
dma_map_sg -> doing DMA -> dma_unmap_sg -> 
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
dma_map_sg -> doing DMA -> dma_unmap_sg ->  
.... /* map, DMA transfer, unmap many times */

Even though upper-layer software might use the same buffers multiple times, for 
each single DMA transmission, map and unmap still need to be done by 
lower-level drivers as lower-layer drivers don't know this fact.

A possible routine to improve the performance of stream APIs is like:
dma_map_sg -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
dma_sync_sg_for_device -> doing DMA -> 
... ->    /* sync between DMA and CPU many times */
dma_unmap_sg

For every single DMA, software only needs to do sync operations which are much 
lighter that map and unmap. But this case is often not applicable to device 
drivers as the buffers usually come from the upper-layer filesystem, network 
protocol, mm management system etc. Device drivers have to work with the 
assumption that the buffer will be freed immediately after DMA is done. 
However, for those device drivers which are able to allocate and free the DMA 
stream buffers by themselves, they will get benefits of reusing the same 
buffers for doing DMA multiple times without map/unmap overhead.

I collected some latency data for iommu_dma_map_sg and iommu_dma_unmap_sg. In 
the test case, zswap is calling acomp APIs to compress/decompress pages, and 
comp/decomp is done by lower-level hardware ZIP driver.
root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_map_sg
Tracing 1 functions for "iommu_dma_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 2274570  |***********************                 |
      2048 -> 4095       : 3896310  |****************************************|
      4096 -> 8191       : 74499    |                                        |
      8192 -> 16383      : 4475     |                                        |
     16384 -> 32767      : 1519     |                                        |
     32768 -> 65535      : 480      |                                        |
     65536 -> 131071     : 286      |                                        |
    131072 -> 262143     : 18       |                                        |
    262144 -> 524287     : 2        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency iommu_dma_unmap_sg
Tracing 1 functions for "iommu_dma_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 56083    |                                        |
      4096 -> 8191       : 5232036  |****************************************|
      8192 -> 16383      : 7723     |                                        |
     16384 -> 32767      : 1277     |                                        |
     32768 -> 65535      : 32       |                                        |
     65536 -> 131071     : 12       |                                        |
    131072 -> 262143     : 41       |                                        |

In contrast, if we set iommu passthrough, the latency will be much better:

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_map_sg
Tracing 1 functions for "dma_direct_map_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 10798    |                                        |
      1024 -> 2047       : 1435035  |****************************************|
      2048 -> 4095       : 13879    |                                        |
      4096 -> 8191       : 485      |                                        |
      8192 -> 16383      : 791      |                                        |
     16384 -> 32767      : 418      |                                        |
     32768 -> 65535      : 55       |                                        |
     65536 -> 131071     : 67       |                                        |
    131072 -> 262143     : 8        |                                        |

root@ubuntu:/usr/share/bcc/tools# ./funclatency dma_direct_unmap_sg
Tracing 1 functions for "dma_direct_unmap_sg"... Hit Ctrl-C to end.
^C
     nsecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 216      |                                        |
      1024 -> 2047       : 250849   |****************************************|
      2048 -> 4095       : 54341    |********                                |
      4096 -> 8191       : 80       |                                        |
      8192 -> 16383      : 191      |                                        |
     16384 -> 32767      : 65       |                                        |

In summary, the comparison is as below:
(1)map
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 2-4us

(2)unmap
iommu passthrough         mainly 1-2us
iommu non-passthrough     mainly 4-8us

The below is the long function trace for each dma_map/unmap_sg while iommu is 
enabled:

  507.520069 |   53)               |  iommu_dma_map_sg() {
  507.520070 |   53)   0.670 us    |    iommu_get_dma_domain();
  507.520071 |   53)   0.610 us    |    iommu_dma_deferred_attach();
  507.520072 |   53)               |    iommu_dma_alloc_iova.isra.26() {
  507.520073 |   53)               |      alloc_iova_fast() {
  507.520074 |   53)               |        _raw_spin_lock_irqsave() {
  507.520074 |   53)   0.570 us    |          preempt_count_add();
  507.520076 |   53)   2.060 us    |        }
  507.520077 |   53)               |        _raw_spin_unlock_irqrestore() {
  507.520077 |   53)   0.790 us    |          preempt_count_sub();
  507.520079 |   53)   2.090 us    |        }
  507.520079 |   53)   6.260 us    |      }
  507.520080 |   53)   7.470 us    |    }
  507.520081 |   53)               |    iommu_map_sg_atomic() {
  507.520081 |   53)               |      __iommu_map_sg() {
  507.520082 |   53)               |        __iommu_map() {
  507.520082 |   53)   0.630 us    |          iommu_pgsize.isra.14();
  507.520084 |   53)               |          arm_smmu_map() {
  507.520084 |   53)               |            arm_lpae_map() {
  507.520085 |   53)               |              __arm_lpae_map() {
  507.520086 |   53)               |                __arm_lpae_map() {
  507.520086 |   53)               |                  __arm_lpae_map() {
  507.520087 |   53)   0.930 us    |                    __arm_lpae_map();
  507.520089 |   53)   2.170 us    |                  }
  507.520089 |   53)   3.490 us    |                }
  507.520090 |   53)   4.730 us    |              }
  507.520090 |   53)   5.980 us    |            }
  507.520091 |   53)   7.250 us    |          }
  507.520092 |   53)   0.650 us    |          iommu_pgsize.isra.14();
  507.520093 |   53)               |          arm_smmu_map() {
  507.520093 |   53)               |            arm_lpae_map() {
  507.520094 |   53)               |              __arm_lpae_map() {
  507.520095 |   53)               |                __arm_lpae_map() {
  507.520096 |   53)               |                  __arm_lpae_map() {
  507.520096 |   53)   0.630 us    |                    __arm_lpae_map();
  507.520098 |   53)   1.860 us    |                  }
  507.520098 |   53)   3.210 us    |                }
  507.520099 |   53)   4.610 us    |              }
  507.520099 |   53)   5.860 us    |            }
  507.520100 |   53)   7.110 us    |          }
  507.520101 |   53) + 18.740 us   |        }
  507.520101 |   53) + 20.080 us   |      }
  507.520102 |   53) + 21.320 us   |    }
  507.520102 |   53) + 33.200 us   |  }

  783.039976 |   48)               |  iommu_dma_unmap_sg() {
  783.039977 |   48)               |    __iommu_dma_unmap() {
  783.039978 |   48)   0.720 us    |      iommu_get_dma_domain();
  783.039979 |   48)               |      iommu_unmap_fast() {
  783.039980 |   48)               |        __iommu_unmap() {
  783.039981 |   48)   0.740 us    |          iommu_pgsize.isra.14();
  783.039982 |   48)               |          arm_smmu_unmap() {
  783.039983 |   48)               |            arm_lpae_unmap() {
  783.039984 |   48)               |              __arm_lpae_unmap() {
  783.039985 |   48)               |                __arm_lpae_unmap() {
  783.039985 |   48)               |                  __arm_lpae_unmap() {
  783.039986 |   48)               |                    __arm_lpae_unmap() {
  783.039988 |   48)   0.730 us    |                      
arm_smmu_tlb_inv_page_nosync();
  783.039989 |   48)   3.010 us    |                    }
  783.039990 |   48)   4.490 us    |                  }
  783.039991 |   48)   5.950 us    |                }
  783.039991 |   48)   7.460 us    |              }
  783.039992 |   48)   8.920 us    |            }
  783.039993 |   48) + 10.380 us   |          }
  783.039993 |   48) + 13.350 us   |        }
  783.039994 |   48) + 14.820 us   |      }
  783.039995 |   48)               |      arm_smmu_iotlb_sync() {
  783.039996 |   48)               |        arm_smmu_tlb_inv_range() {
  783.039996 |   48)               |          arm_smmu_cmdq_batch_add() {
  783.039997 |   48)   0.760 us    |            arm_smmu_cmdq_build_cmd();
  783.039999 |   48)   2.220 us    |          }
  783.039999 |   48)               |          arm_smmu_cmdq_issue_cmdlist() {
  783.040000 |   48)   0.530 us    |            arm_smmu_cmdq_build_cmd();
  783.040001 |   48)   0.530 us    |            
__arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040002 |   48)   0.540 us    |            
__arm_smmu_cmdq_poll_set_valid_map.isra.40();
  783.040004 |   48)               |            ktime_get() {
  783.040004 |   48)   0.540 us    |              arch_counter_read();
  783.040005 |   48)   1.570 us    |            }
  783.040006 |   48)   6.880 us    |          }
  783.040007 |   48)   0.830 us    |          
arm_smmu_atc_inv_domain.constprop.48();
  783.040008 |   48) + 12.910 us   |        }
  783.040009 |   48) + 14.370 us   |      }
  783.040010 |   48)               |      iommu_dma_free_iova() {
  783.040011 |   48)               |        free_iova_fast() {
  783.040011 |   48)               |          _raw_spin_lock_irqsave() {
  783.040012 |   48)   0.600 us    |            preempt_count_add();
  783.040013 |   48)   2.000 us    |          }
  783.040014 |   48)               |          _raw_spin_unlock_irqrestore() {
  783.040015 |   48)   0.820 us    |            preempt_count_sub();
  783.040016 |   48)   2.220 us    |          }
  783.040018 |   48)   6.200 us    |        }
  783.040019 |   48)   8.880 us    |      }
  783.040020 |   48) + 42.540 us   |    }
  783.040020 |   48) + 44.030 us   |  }

I am thinking several possible ways on decreasing or removing the latency of 
DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an 
existing option with possible safety issues, I won't discuss it in this mail.

1. provide bounce coherent buffers for streaming buffers. 
As the coherent buffers keep the status of mapping, we can remove the overhead 
of map and unmap for each single DMA operations. However, this solution 
requires memory copy between stream buffers and bounce buffers. Thus it will 
work only if copy is faster than map/unmap. Meanwhile, it will consume much 
more memory bandwidth.

2.make upper-layer kernel components aware of the pain of iommu map/unmap
upper-layer fs, mm, networks can somehow let the lower-layer drivers know the 
end of the life cycle of sg buffers. In zswap case, I have seen zswap always 
use the same 2 pages as the destination buffers to save compressed page, but 
the compressor driver still has to constantly map and unmap those same two 
pages for every single compression since zswap and zip drivers are working in 
two completely different software layers.

I am thinking some way as below, upper-layer kernel code can call:
sg_init_table(&sg...);
sg_mark_reusable(&sg....);
.... /* use the buffer many times */
....
sg_mark_stop_reuse(&sg);

After that, if low level drivers see "reusable" flag, it will realize the 
buffer can be used multiple times and will not do map/unmap every time. it 
means upper-layer components will further use the buffers and the same buffers 
will probably be given to lower-layer drivers for new DMA transfer later. When 
upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg 
buffers, possibly by providing a unmap-callback to upper-layer components. For 
zswap case, I have seen the same buffers are always re-used and zip driver maps 
and unmaps it again and again. Shortly after the buffer is unmapped, it will be 
mapped in the next transmission, almost without any time gap between unmap and 
map. In case zswap can set the "reusable" flag, zip driver will save a lot of 
time.
Meanwhile, for the safety of buffers, lower-layer drivers need to make certain 
the buffers have already been unmapped in iommu before those buffers go back to 
buddy for other users.

I don't think letting upper-layer components aware of the overhead of map and 
unmap is elegant. But it might be something which deserves to be done for 
performance reason. Upper-layer software which is friendly to lower-layer 
driver might call sg_mark_reusable(&sg....). But it is not enforced, if 
upper-layer components don't call the API, the current lower-level driver won't 
be affected.

Please kindly give your comments on this proposal and provide your suggestions 
on any possible way to improve the performance of DMA stream APIs with iommu 
backend. I am glad to send a draft patch for "reusable" buffers if you think it 
is not bad.

Best Regards
Barry

_______________________________________________
iommu mailing list
[email protected]
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to