On Thu, Jul 6, 2017 at 11:27 PM, Arnd Bergmann <a...@arndb.de> wrote: > On Thu, Jul 6, 2017 at 4:06 PM, Tomasz Figa <tf...@chromium.org> wrote: >> On Thu, Jul 6, 2017 at 11:02 PM, Arnd Bergmann <a...@arndb.de> wrote: >>> On Thu, Jul 6, 2017 at 3:49 PM, Tomasz Figa <tf...@chromium.org> wrote: >>>> On Thu, Jul 6, 2017 at 10:31 PM, Tomasz Figa <tf...@chromium.org> wrote: >>> >>>>> On the other hand, if it's strictly about base/dma-mapping, we might >>>>> not need it indeed. The driver could call iommu-dma helpers directly, >>>>> without the need to provide its own DMA ops. One caveat, though, we >>>>> are not able to obtain coherent (i.e. uncached) memory with this >>>>> approach, which might have some performance effects and complicates >>>>> the code, that would now need to flush caches even for some small >>>>> internal buffers. >>>> >>>> I think I should add a bit of explanation here: >>>> 1) the device is non-coherent with CPU caches, even on x86, >>>> 2) it looks like x86 does not have non-coherent DMA ops, (but it >>>> might be something that could be fixed) >>> >>> I don't understand what this means here. The PCI on x86 is always >>> cache-coherent, so why is the device not? >>> >>> Do you mean that the device has its own caches that may need >>> flushing to make the device cache coherent with the CPU cache, >>> rather than flushing the CPU caches? >> >> Sakari might be able to explain this with more technical details, but >> generally the device is not a standard PCI device one might find on >> existing x86 systems. >> >> It is some kind of embedded subsystem that behaves mostly like a PCI >> device, with certain exceptions, one being the lack of coherency with >> CPU caches, at least for certain parts of the subsystem. The reference >> vendor code disables the coherency completely, for reasons not known >> to me, but AFAICT this is the preferred operating mode, possibly due >> to performance effects (this is a memory-heavy image processing > > Ok, got it. I think something similar happens on integrated GPUs for > a certain CPU family. The DRM code has its own ways of dealing with > this kind of device. If you find that the hardware to be closely > related (either the implementation, or the location on the internal > buses) to the GPU on this machine, I'd recommend having a look > in drivers/gpu/drm to see how it's handled there, and if that code could > be shared.
I think it's not closely related, but might be a very similar case. Still, DRM is very liberal in terms of not using common code for doing things, while V4L2 tries to makes things generic as much as possible. There is already the vb2_dma_contig backend, which allocates coherent memory (in case of V4L2-allocated buffers), manages caches (in case of userptr or DMA-buf buffers) and so on for you. If we can't have the DMA ops do the right thing, the code there is essentially useless and you are left with vb2_dma_sg that uses a page allocator and gives the driver sg tables (it actually can also do cache management for you, but since dma_sync_sg_*() is essentially a no-op on x86, the driver would have to do it on its own). Best regards, Tomasz