> Subject: Re: Constantly map and unmap of streaming DMA buffers with > IOMMU backend might cause serious performance problem > > On 2020-05-15 09:19, Song Bao Hua wrote: > [ snip... nice analysis, but ultimately it's still "doing stuff has more > overhead > than not doing stuff" ] > > > I am thinking several possible ways on decreasing or removing the latency of > DMA map/unmap for every single DMA transfer. Meanwhile, "non-strict" as an > existing option with possible safety issues, I won't discuss it in this mail. > > But passthrough and non-strict mode *specifically exist* for the cases where > performance is the most important concern - streaming DMA with an IOMMU > in the middle has an unavoidable tradeoff between performance and isolation, > so dismissing that out of hand is not a good way to start making this > argument.
I do understand there is a tradeoff between performance and isolation. However, users might ask for performance while supporting isolation. In passthrough mode, the whole memory might be accessible by DMA. In non-strict mode, a buffer could be still mapped in IOMMU when users have returned it to buddy and the buffer has even been allocated by another user. > > > 1. provide bounce coherent buffers for streaming buffers. > > As the coherent buffers keep the status of mapping, we can remove the > overhead of map and unmap for each single DMA operations. However, this > solution requires memory copy between stream buffers and bounce buffers. > Thus it will work only if copy is faster than map/unmap. Meanwhile, it will > consume much more memory bandwidth. > > I'm struggling to understand how that would work, can you explain it in more > detail? lower-layer drivers maintain some reusable coherent buffers. For TX path, drivers copy streaming buffer to coherent buffer, then do DMA; For RX path, drivers do DMA in coherent buffer, then copy to streaming buffer. > > > 2.make upper-layer kernel components aware of the pain of iommu > > map/unmap upper-layer fs, mm, networks can somehow let the lower-layer > drivers know the end of the life cycle of sg buffers. In zswap case, I have > seen > zswap always use the same 2 pages as the destination buffers to save > compressed page, but the compressor driver still has to constantly map and > unmap those same two pages for every single compression since zswap and zip > drivers are working in two completely different software layers. > > > > I am thinking some way as below, upper-layer kernel code can call: > > sg_init_table(&sg...); > > sg_mark_reusable(&sg....); > > .... /* use the buffer many times */ > > .... > > sg_mark_stop_reuse(&sg); > > > > After that, if low level drivers see "reusable" flag, it will realize the > > buffer can > be used multiple times and will not do map/unmap every time. it means > upper-layer components will further use the buffers and the same buffers will > probably be given to lower-layer drivers for new DMA transfer later. When > upper-layer code sets " stop_reuse", lower-layer driver will unmap the sg > buffers, possibly by providing a unmap-callback to upper-layer components. > For zswap case, I have seen the same buffers are always re-used and zip driver > maps and unmaps it again and again. Shortly after the buffer is unmapped, it > will be mapped in the next transmission, almost without any time gap > between unmap and map. In case zswap can set the "reusable" flag, zip driver > will save a lot of time. > > Meanwhile, for the safety of buffers, lower-layer drivers need to make > > certain > the buffers have already been unmapped in iommu before those buffers go > back to buddy for other users. > > That sounds like it would only have benefit in a very small set of specific > circumstances, and would be very difficult to generalise to buffers that are > mapped via dma_map_page() or dma_map_single(). Yes, indeed. Hopefully the small set of specific circumstances will encourage more upper-layer consumers to reuse buffers, then the "reusable" flag can extend to more common cases, such as page and single buffer. > Furthermore, a high-level API that affects a low-level driver's > interpretation of > mid-layer API calls without the mid-layer's knowledge sounds like a hideous > abomination of anti-design. If a mid-layer API lends itself to inefficiency > at the > lower level, it would seem a lot cleaner and more robust to extend *that* API > for stateful buffer reuse. Absolutely agree. I didn't say the method is elegant. For this moment, maybe "reuse" can get started from a small case like zswap. After some while, it is possible more users are encouraged to do some optimization for buffer reuse, understanding the suffering of lower-layer drivers. Then those performance problems might be solved case by case. On the other hand, it is always the freedom of upper-layer code to indicate "reuse" or not. If they don't say anything about reuse, lower-layer drivers can simply do map and unmap. > Failing that, it might possibly be appropriate to approach this at the driver > level - many of the cleverer network drivers already implement buffer pools to > recycle mapped SKBs internally, couldn't the "zip driver" simply try doing > something like that for itself? are the buffer pools for RX path? For TX path, buffers come from upper-layer so network drivers can't do anything for recycling SKBs? > > Robin. Thanks Barry _______________________________________________ iommu mailing list [email protected] https://lists.linuxfoundation.org/mailman/listinfo/iommu
