Re: Try to address the DMA-buf coherency problem

2023-01-05 Thread Sebastian Wick
On Fri, Dec 9, 2022 at 11:28 AM Christian König
 wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
> > [SNIP]
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Yeah, completely agree. The crux is really that we need to improve those
> user space allocators by providing more information from the kernel.
>
> >> 2. We don't properly communicate allocation requirements to userspace.
> >>
> >> E.g. even if you allocate from DMA-Heaps userspace can currently only
> >> guess if normal, CMA or even device specific memory is needed.
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right heap. With allocation built
> > into the specific UAPI (like V4L2), it's at least possible to allocate
> > for one specific device without having any knowledge about allocation
> > constraints in the userspace.
>
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.

This information doesn't necessarily have to come from the kernel. For
KMS it makes sense for the kernel to provide it but Vulkan drivers for
example could have the information entirely in the UMD. The important
point is that user space can somehow query which heap can be used with
which constraints for a specific operation. At that point it's
entirely a user space problem.

>
> >> So if a device driver uses cached system memory on an architecture 
> >> which
> >> devices which can't access it the right approach is clearly to reject
> >> the access.
> > I'd like to accent the fact that "requires cache maintenance" != "can't 
> > access".
>  Well that depends. As said above the exporter exports the buffer as it
>  was allocated.
> 
>  If that means the the exporter provides a piece of memory which requires
>  CPU cache snooping to access correctly then the best thing we can do is
>  to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.
>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only 

Re: [Linaro-mm-sig] Re: Try to address the DMA-buf coherency problem

2023-01-05 Thread Daniel Vetter
On Fri, Dec 09, 2022 at 12:07:49PM -0500, Alex Deucher wrote:
> On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen  wrote:
> >
> > On Fri, 9 Dec 2022 17:26:06 +0900
> > Tomasz Figa  wrote:
> >
> > > On Mon, Dec 5, 2022 at 5:29 PM Christian König  
> > > wrote:
> > > >
> > > > Hi Tomasz,
> > > >
> > > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > > [SNIP]
> > > > >> In other words explicit ownership transfer is not something we would
> > > > >> want as requirement in the framework, cause otherwise we break tons 
> > > > >> of
> > > > >> use cases which require concurrent access to the underlying buffer.
> > > > >>
> > > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > > >> possible to implement this using the dma_fence objects mentioned 
> > > > >> above.
> > > > >> E.g. drivers can already look at who is accessing a buffer currently 
> > > > >> and
> > > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > > >> objects.
> > > > >>
> > > > >> The only exception is CPU based access, e.g. when something is 
> > > > >> written
> > > > >> with the CPU a cache flush might be necessary and when something is 
> > > > >> read
> > > > >> with the CPU a cache invalidation might be necessary.
> > > > >>
> > > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > > should be covered for the cache maintenance needs originating from CPU
> > > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > > CPU access routines that I mentioned above.
> > > > >
> > > > > Similarly, for any ownership transfer between different DMA engines,
> > > > > we should be covered either by the userspace explicitly flushing the
> > > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > > >
> > > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > > begin/end CPU access calls.)
> > > >
> > > > Well there are multiple problems here:
> > > >
> > > > 1. A lot of userspace applications/frameworks assume that it can
> > > > allocate the buffer anywhere and it just works.
> > > >
> > > > This isn't true at all, we have tons of cases where device can only
> > > > access their special memory for certain use cases.
> > > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > > supports system memory here. Similar cases exists for audio/video codecs
> > > > where intermediate memory is only accessible by certain devices because
> > > > of content protection.
> > >
> > > Ack.
> > >
> > > Although I think the most common case on mainstream Linux today is
> > > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > > GPU) and hoping that other devices would accept the buffers just fine,
> > > which isn't a given on most platforms (although often it's just about
> > > pixel format, width/height/stride alignment, tiling, etc. rather than
> > > the memory itself). That's why ChromiumOS has minigbm and Android has
> > > gralloc that act as the central point of knowledge on buffer
> > > allocation.
> >
> > Hi,
> >
> > as an anecdote, when I was improving Mutter's cross-DRM-device handling
> > (for DisplayLink uses) a few years ago, I implemented several different
> > approaches of where to allocate, to try until going for the slowest but
> > guaranteed to work case of copying every update into and out of sysram.
> >
> > It seems there are two different approaches in general for allocation
> > and sharing:
> >
> > 1. Try different things until it works or you run out of options
> >
> > pro:
> > - no need for a single software component to know everything about
> >   every device in the system
> >
> > con:
> > - who bothers with fallbacks, if the first try works on my system for
> >   my use case I test with? I.e. cost of code in users.
> > - trial-and-error can be very laborious (allocate, share with all
> >   devices, populate, test)
> > - the search space might be huge
> >
> 
> Even that is fraught with difficulty.  We had a ton of bug reports
> over the years claiming amdgpu was broken when users tried to use
> displaylink devices in combination with AMD dGPUs because the
> performance was so slow.  The problem was that rather than using
> dma-buf, the compositor was just mmaping the the dGPU BAR,  which
> happens to be uncached write combined and across the PCI bus, and
> copying the data to the displaylink device.  Read access to a PCI BAR
> with the CPU is on the order of 10s of MB per second.

Yeah you shouldn't readback dma-buf directly when you get them from a
gl/vk driver, but instead use the tools these apis provide for cpu access.

But that just puts us even more firmly into the "who even bothers"
territory because people just hack up something that happens to work on
their igpu (where at least on x86 you nowadays tend to get fully cpu
cached buffers).
-Daniel

> 
> Alex
> 
> >
> > 2. Have a central component that knows what to do
> >
> > pro:
> > - It 

Re: Try to address the DMA-buf coherency problem

2023-01-05 Thread Daniel Vetter
On Fri, Dec 02, 2022 at 04:27:08PM +0100, Christian König wrote:
> Am 30.11.22 um 11:30 schrieb Daniel Vetter:
> > On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> > > Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > > > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > > > Christian König  wrote:
> > > > > 
> > > > > > We should have come up with dma-heaps earlier and make it clear that
> > > > > > exporting a DMA-buf from a device gives you something device 
> > > > > > specific
> > > > > > which might or might not work with others.
> > > > > > 
> > > > > > Apart from that I agree, DMA-buf should be capable of handling this.
> > > > > > Question left is what documentation is missing to make it clear how
> > > > > > things are supposed to work?
> > > > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > > > forgotten:
> > > > > https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2Fdata=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3Dreserved=0
> > > > > 
> > > > > It aimed mostly at userspace, but sounds to me like the coherency 
> > > > > stuff
> > > > > could use a section of its own there?
> > > > Hm yeah it would be great to land that and then eventually extend. 
> > > > Daniel?
> > > There is a lot of things documented in this document that have been said 
> > > to be
> > > completely wrong user-space behaviour in this thread. But it seems to 
> > > pre-date
> > > the DMA Heaps. The document also assume that DMA Heaps completely solves 
> > > the CMA
> > > vs system memory issue. But it also underline a very important aspect, 
> > > that
> > > userland is not aware which one to use. What this document suggest though 
> > > seems
> > > more realist then what has been said here.
> > > 
> > > Its overall a great document, it unfortunate that it only makes it into 
> > > the DRM
> > > mailing list.
> > The doc is more about document the current status quo/best practices,
> > which is very much not using dma-heaps.
> > 
> > The issue there is that currently userspace has no idea which dma-heap to
> > use for shared buffers, plus not all allocators are exposed through heaps
> > to begin with. We had this noted as a todo item (add some device->heap
> > sysfs links was the idea), until that's done all you can do is hardcode
> > the right heaps for the right usage in userspace, which is what android
> > does. Plus android doesn't have dgpu, so doesn't need the missing ttm
> > heap.
> 
> Hui? Why do you think we should have a TTM heap in the first place?

[sorry for the really late reply]

Extremely late reply, but p2p buffer sharing when you want it in vram is
only something ttm can provide you. And if the goal is to have all buffer
sharing needs served by dma-heaps, then eventually that'll be needed too.

But probably totally fine to just not have it for another few years
(decades?).

> As far as I can see we need three different ways of allocation:
> 1. Normal system memory.
> 2. CMA based.
> 3. Device specific.
> 
> When any of the participating devices needs CMA then user space must use the
> CMA heap, when any of the participating devices needs device specific memory
> then user space must use device specific memory (from that device).
> 
> It still can be that two devices can't talk with each other because both
> needs the buffer to be allocated from device specific memory for a
> particular use case, but at least all the cases for both none CPU cache
> coherent ARM devices as well as device specific memory for scanout should be
> handled with this.

Yeah on ARM this should be enough.
-Daniel

> 
> Regards,
> Christian.
> 
> > 
> > But yeah the long-term aspiration also hasn't changed, because the
> > dma-heap todo list is also very, very old by now :-/
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: Try to address the DMA-buf coherency problem

2022-12-12 Thread Christian König

Am 12.12.22 um 04:00 schrieb Tomasz Figa:

[SNIP]

What we could do is to force all exporters to use begin/end_cpu_access()
even on it's own buffers and look at all the importers when the access
is completed. But that would increase the complexity of the handling in
the exporter.

I feel like they should be doing so anyway, because it often depends
on the SoC integration whether the DMA can do cache snooping or not.


Yeah, but wouldn't help without the exporter taking care of the 
importers needs of cache flushing. And that's exactly what we try hard 
to avoid because it creates code in every exporter which is never tested 
except for the case that you combine this exporter with an non coherent 
importer.


That's a testing nightmare because then you everywhere has code which 
only in very few combinations of exporter and importer is actually used.



Although arguably, there is a corner case today where if one uses
dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
device X that is declared as cache-coherent, one also expects not to
need to call begin/end_cpu_access(), but those would be needed if the
buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one
step closer to something sensible, import would fail in such cases.
Although arguably we should be able to still export from driver Y and
import to driver X just fine if Y allocated the buffer as coherent -
otherwise we would break existing users for whom things worked fine.


Allocating the buffer as coherent won't help in this case because we 
actually do want CPU caching for performant access. It's just that some 
device needs a cache flush before it sees all the changes.


As far as I can see without adding additional complexity to the exporter 
this can only be archived in two ways:


1. Switch the role of the exporter and importer. This way the device 
with the need for the cache flush is the exporter and in control of the 
operations on its buffers.


2. We use DMA-buf as neutral mediator. Since DMA-buf keeps track of who 
has mapped the buffers it inserts the appropriate 
dma_sync_*_for_device() calls.



[SNIP]
The best we can do is to reject combinations which won't work in the
kernel and then userspace could react accordingly.


The question is whether userspace is able to deal with it, given the
limited amount of information it gets from the kernel. Sure, there is
always the ultimate fallback of memcpy(), but in many cases that would
be totally unusable due to severe performance impact. If we were to
remove the existing extent of implicit handling from the kernel, I
think we need to first give the userspace the information necessary to
explicitly handle the fallback to the same extent.


Good point.


We also need to think about backwards compatibility. Simply removing
the implicit fallback cases would probably break a lot of userspace,
so an opt-in behavior is likely needed initially...


Yes, I'm completely aware of that as well.

We can't hard break userspace even if the previous approach didn't 
worked 100% correctly.



That's essentially the reason why we have DMA-buf heaps. Those heaps
expose system memory buffers with certain properties (size, CMA, DMA bit
restrictions etc...) and also implement the callback functions for CPU
cache maintenance.


How do DMA-buf heaps change anything here? We already have CPU cache
maintenance callbacks in the DMA-buf API - the begin/end_cpu_access()
for CPU accesses and dmabuf_map/unmap_attachment() for device accesses
(which arguably shouldn't actually do CPU cache maintenance, unless
that's implied by how either of the involved DMA engines work).

DMA-buf heaps are the neutral man in the middle.

The implementation keeps track of all the attached importers and should
make sure that the allocated memory fits the need of everyone.
Additional to that calls to the cache DMA-api cache management functions
are inserted whenever CPU access begins/ends.


I think in current design, it only knows all the importers after the
buffer is already allocated, so it doesn't necessarily have a way to
handle the allocation constraints. Something would have to be done to
get all the importers attached before the allocation actually takes
place.


That's already in place. See the attach and map callbacks.

I'm just not sure if heaps fully implements it like this.


The problem is that in this particular case the exporter provides the
buffer as is, e.g. with dirty CPU caches. And the importer can't deal
with that.

Why does the exporter leave the buffer with dirty CPU caches?

Because exporters always export the buffers as they would use it. And in
this case that means writing with the CPU to it.


Sorry for the question not being very clear. I meant: How do the CPU
caches get dirty in that case?

The exporter wrote to it. As far as I understand the exporter just
copies things from A to B with memcpy to construct the 

Re: Try to address the DMA-buf coherency problem

2022-12-11 Thread Tomasz Figa
On Fri, Dec 9, 2022 at 6:32 PM Pekka Paalanen  wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa  wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König  
> > wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently 
> > > >> and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is 
> > > >> read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>
>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to 
> https://github.com/cubanismo/allocator .
>

I need to refresh my memory on how far we went with that and what the
stoppers were, but it really sounds like we need it to make things
really work on a mainstream Linux system.

When I was still participating in the discussions, I remember the idea
was to have an API exposed by various components, like EGL, Vulkan,
etc. to describe the constraints. Maybe to simplify the problem,
instead of having this complex negotiation between different APIs, we
could have a plugin-based library and plugins would be installed
together with the various API implementations (e.g. even proprietary
Nvidia drivers could provide an open source allocation plugin to tell
the central allocator library about its requirements)?

I also recall we were stopped by lack of a generic DMA-buf allocation
functionality exposed to the userspace, but that now exists as DMA-buf
heaps.

>
> > > 2. We 

Re: Try to address the DMA-buf coherency problem

2022-12-11 Thread Tomasz Figa
On Fri, Dec 9, 2022 at 7:27 PM Christian König  wrote:
>
> Am 09.12.22 um 09:26 schrieb Tomasz Figa:
[snip]
> Yes, same what Daniel said as well. We need to provide some more hints
> which allocator to use from the kernel.
>
> >> So if a device driver uses cached system memory on an architecture 
> >> which
> >> devices which can't access it the right approach is clearly to reject
> >> the access.
> > I'd like to accent the fact that "requires cache maintenance" != "can't 
> > access".
>  Well that depends. As said above the exporter exports the buffer as it
>  was allocated.
> 
>  If that means the the exporter provides a piece of memory which requires
>  CPU cache snooping to access correctly then the best thing we can do is
>  to prevent an importer which can't do this from attaching.
> >>> Could you elaborate more about this case? Does it exist in practice?
> >>> Do I assume correctly that it's about sharing a buffer between one DMA
> >>> engine that is cache-coherent and another that is non-coherent, where
> >>> the first one ends up having its accesses always go through some kind
> >>> of a cache (CPU cache, L2/L3/... cache, etc.)?
> >> Yes, exactly that. What happens in this particular use case is that one
> >> device driver wrote to it's internal buffer with the CPU (so some cache
> >> lines where dirty) and then a device which couldn't deal with that tried
> >> to access it.
> > If so, shouldn't that driver surround its CPU accesses with
> > begin/end_cpu_access() in the first place?
>
> The problem is that the roles are reversed. The callbacks let the
> exporter knows that it needs to flush the caches when the importer is
> done accessing the buffer with the CPU.
>
> But here the exporter is the one accessing the buffer with the CPU and
> the importer then accesses stale data because it doesn't snoop the caches.
>
> What we could do is to force all exporters to use begin/end_cpu_access()
> even on it's own buffers and look at all the importers when the access
> is completed. But that would increase the complexity of the handling in
> the exporter.

I feel like they should be doing so anyway, because it often depends
on the SoC integration whether the DMA can do cache snooping or not.

Although arguably, there is a corner case today where if one uses
dma_alloc_coherent() to get a buffer with a coherent CPU mapping for
device X that is declared as cache-coherent, one also expects not to
need to call begin/end_cpu_access(), but those would be needed if the
buffer was to be imported by device Y that is not cache-coherent...

Sounds like after all it's a mess. I guess your patches make it one
step closer to something sensible, import would fail in such cases.
Although arguably we should be able to still export from driver Y and
import to driver X just fine if Y allocated the buffer as coherent -
otherwise we would break existing users for whom things worked fine.

>
> In other words we would then have code in the exporters which is only
> written for handling the constrains of the importers. This has a wide
> variety of consequences, especially that this functionality of the
> exporter can't be tested without a proper importer.
>
> I was also thinking about reversing the role of exporter and importer in
> the kernel, but came to the conclusion that doing this under the hood
> without userspace knowing about it is probably not going to work either.
>
> > The case that I was suggesting was of a hardware block that actually
> > sits behind the CPU cache and thus dirties it on writes, not the
> > driver doing that. (I haven't personally encountered such a system,
> > though.)
>
> Never heard of anything like that either, but who knows.
>
> >> We could say that all device drivers must always look at the coherency
> >> of the devices which want to access their buffers. But that would
> >> horrible complicate things for maintaining the drivers because then
> >> drivers would need to take into account requirements from other drivers
> >> while allocating their internal buffers.
> > I think it's partially why we have the allocation part of the DMA
> > mapping API, but currently it's only handling requirements of one
> > device. And we don't have any information from the userspace what
> > other devices the buffer would be used with...
>
> Exactly that, yes.
>
> > Actually, do we even have such information in the userspace today?
> > Let's say I do a video call in a web browser on a typical Linux
> > system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
> > V4L2 camera fills in buffers with video frames and both encoder and
> > display consume them. Do we have a central place which would know that
> > a buffer needs to be allocated that works with the producer and all
> > consumers?
>
> Both X11 and Wayland have protocols which can be used to display a
> certain DMA-buf handle, their feedback packages contain information how
> ideal your 

Re: Try to address the DMA-buf coherency problem

2022-12-09 Thread Alex Deucher
On Fri, Dec 9, 2022 at 4:32 AM Pekka Paalanen  wrote:
>
> On Fri, 9 Dec 2022 17:26:06 +0900
> Tomasz Figa  wrote:
>
> > On Mon, Dec 5, 2022 at 5:29 PM Christian König  
> > wrote:
> > >
> > > Hi Tomasz,
> > >
> > > Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > > > [SNIP]
> > > >> In other words explicit ownership transfer is not something we would
> > > >> want as requirement in the framework, cause otherwise we break tons of
> > > >> use cases which require concurrent access to the underlying buffer.
> > > >>
> > > >> When a device driver needs explicit ownership transfer it's perfectly
> > > >> possible to implement this using the dma_fence objects mentioned above.
> > > >> E.g. drivers can already look at who is accessing a buffer currently 
> > > >> and
> > > >> can even grab explicit ownership of it by adding their own dma_fence
> > > >> objects.
> > > >>
> > > >> The only exception is CPU based access, e.g. when something is written
> > > >> with the CPU a cache flush might be necessary and when something is 
> > > >> read
> > > >> with the CPU a cache invalidation might be necessary.
> > > >>
> > > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > > should be covered for the cache maintenance needs originating from CPU
> > > > accesses already, +/- the broken cases which don't call the begin/end
> > > > CPU access routines that I mentioned above.
> > > >
> > > > Similarly, for any ownership transfer between different DMA engines,
> > > > we should be covered either by the userspace explicitly flushing the
> > > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > > >
> > > > But then, what's left to be solved? :) (Besides the cases of missing
> > > > begin/end CPU access calls.)
> > >
> > > Well there are multiple problems here:
> > >
> > > 1. A lot of userspace applications/frameworks assume that it can
> > > allocate the buffer anywhere and it just works.
> > >
> > > This isn't true at all, we have tons of cases where device can only
> > > access their special memory for certain use cases.
> > > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > > supports system memory here. Similar cases exists for audio/video codecs
> > > where intermediate memory is only accessible by certain devices because
> > > of content protection.
> >
> > Ack.
> >
> > Although I think the most common case on mainstream Linux today is
> > properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> > GPU) and hoping that other devices would accept the buffers just fine,
> > which isn't a given on most platforms (although often it's just about
> > pixel format, width/height/stride alignment, tiling, etc. rather than
> > the memory itself). That's why ChromiumOS has minigbm and Android has
> > gralloc that act as the central point of knowledge on buffer
> > allocation.
>
> Hi,
>
> as an anecdote, when I was improving Mutter's cross-DRM-device handling
> (for DisplayLink uses) a few years ago, I implemented several different
> approaches of where to allocate, to try until going for the slowest but
> guaranteed to work case of copying every update into and out of sysram.
>
> It seems there are two different approaches in general for allocation
> and sharing:
>
> 1. Try different things until it works or you run out of options
>
> pro:
> - no need for a single software component to know everything about
>   every device in the system
>
> con:
> - who bothers with fallbacks, if the first try works on my system for
>   my use case I test with? I.e. cost of code in users.
> - trial-and-error can be very laborious (allocate, share with all
>   devices, populate, test)
> - the search space might be huge
>

Even that is fraught with difficulty.  We had a ton of bug reports
over the years claiming amdgpu was broken when users tried to use
displaylink devices in combination with AMD dGPUs because the
performance was so slow.  The problem was that rather than using
dma-buf, the compositor was just mmaping the the dGPU BAR,  which
happens to be uncached write combined and across the PCI bus, and
copying the data to the displaylink device.  Read access to a PCI BAR
with the CPU is on the order of 10s of MB per second.

Alex

>
> 2. Have a central component that knows what to do
>
> pro:
> - It might work on the first attempt, so no fallbacks in users.
> - It might be optimal.
>
> con:
> - You need a software component that knows everything about every
>   single combination of hardware in existence, multiplied by use cases.
>
>
> Neither seems good, which brings us back to 
> https://github.com/cubanismo/allocator .
>
>
> > > 2. We don't properly communicate allocation requirements to userspace.
> > >
> > > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > > guess if normal, CMA or even device specific memory is needed.
> >
> > DMA-buf heaps actually make it even more difficult for the userspace,
> > because now it needs to pick the right 

Re: Try to address the DMA-buf coherency problem

2022-12-09 Thread Christian König

Am 09.12.22 um 09:26 schrieb Tomasz Figa:

[SNIP]
Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.


Yeah, completely agree. The crux is really that we need to improve those 
user space allocators by providing more information from the kernel.



2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only
guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.


Yes, same what Daniel said as well. We need to provide some more hints 
which allocator to use from the kernel.



So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it
was allocated.

If that means the the exporter provides a piece of memory which requires
CPU cache snooping to access correctly then the best thing we can do is
to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?

Yes, exactly that. What happens in this particular use case is that one
device driver wrote to it's internal buffer with the CPU (so some cache
lines where dirty) and then a device which couldn't deal with that tried
to access it.

If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?


The problem is that the roles are reversed. The callbacks let the 
exporter knows that it needs to flush the caches when the importer is 
done accessing the buffer with the CPU.


But here the exporter is the one accessing the buffer with the CPU and 
the importer then accesses stale data because it doesn't snoop the caches.


What we could do is to force all exporters to use begin/end_cpu_access() 
even on it's own buffers and look at all the importers when the access 
is completed. But that would increase the complexity of the handling in 
the exporter.


In other words we would then have code in the exporters which is only 
written for handling the constrains of the importers. This has a wide 
variety of consequences, especially that this functionality of the 
exporter can't be tested without a proper importer.


I was also thinking about reversing the role of exporter and importer in 
the kernel, but came to the conclusion that doing this under the hood 
without userspace knowing about it is probably not going to work either.



The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)


Never heard of anything like that either, but who knows.


We could say that all device drivers must always look at the coherency
of the devices which want to access their buffers. But that would
horrible complicate things for maintaining the drivers because then
drivers would need to take into account requirements from other drivers
while allocating their internal buffers.

I think it's partially why we have the allocation part of the DMA
mapping API, but currently it's only handling requirements of one
device. And we don't have any information from the userspace what
other devices the buffer would be used with...


Exactly that, yes.


Actually, do we even have such information in the userspace today?
Let's say I do a video call in a web browser on a typical Linux
system. I have a V4L2 camera, VAAPI video encoder and X11 display. The
V4L2 camera fills in buffers with video frames and both encoder and
display consume them. Do we have a central place which would know that
a buffer needs to be allocated that works with the producer and all
consumers?


Both X11 and Wayland have protocols which can be used to display a 
certain DMA-buf handle, their feedback packages contain information how 
ideal your 

Re: Try to address the DMA-buf coherency problem

2022-12-09 Thread Pekka Paalanen
On Fri, 9 Dec 2022 17:26:06 +0900
Tomasz Figa  wrote:

> On Mon, Dec 5, 2022 at 5:29 PM Christian König  
> wrote:
> >
> > Hi Tomasz,
> >
> > Am 05.12.22 um 07:41 schrieb Tomasz Figa:  
> > > [SNIP]  
> > >> In other words explicit ownership transfer is not something we would
> > >> want as requirement in the framework, cause otherwise we break tons of
> > >> use cases which require concurrent access to the underlying buffer.
> > >>
> > >> When a device driver needs explicit ownership transfer it's perfectly
> > >> possible to implement this using the dma_fence objects mentioned above.
> > >> E.g. drivers can already look at who is accessing a buffer currently and
> > >> can even grab explicit ownership of it by adding their own dma_fence
> > >> objects.
> > >>
> > >> The only exception is CPU based access, e.g. when something is written
> > >> with the CPU a cache flush might be necessary and when something is read
> > >> with the CPU a cache invalidation might be necessary.
> > >>  
> > > Okay, that's much clearer now, thanks for clarifying this. So we
> > > should be covered for the cache maintenance needs originating from CPU
> > > accesses already, +/- the broken cases which don't call the begin/end
> > > CPU access routines that I mentioned above.
> > >
> > > Similarly, for any ownership transfer between different DMA engines,
> > > we should be covered either by the userspace explicitly flushing the
> > > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > >
> > > But then, what's left to be solved? :) (Besides the cases of missing
> > > begin/end CPU access calls.)  
> >
> > Well there are multiple problems here:
> >
> > 1. A lot of userspace applications/frameworks assume that it can
> > allocate the buffer anywhere and it just works.
> >
> > This isn't true at all, we have tons of cases where device can only
> > access their special memory for certain use cases.
> > Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> > supports system memory here. Similar cases exists for audio/video codecs
> > where intermediate memory is only accessible by certain devices because
> > of content protection.  
> 
> Ack.
> 
> Although I think the most common case on mainstream Linux today is
> properly allocating for device X (e.g. V4L2 video decoder or DRM-based
> GPU) and hoping that other devices would accept the buffers just fine,
> which isn't a given on most platforms (although often it's just about
> pixel format, width/height/stride alignment, tiling, etc. rather than
> the memory itself). That's why ChromiumOS has minigbm and Android has
> gralloc that act as the central point of knowledge on buffer
> allocation.

Hi,

as an anecdote, when I was improving Mutter's cross-DRM-device handling
(for DisplayLink uses) a few years ago, I implemented several different
approaches of where to allocate, to try until going for the slowest but
guaranteed to work case of copying every update into and out of sysram.

It seems there are two different approaches in general for allocation
and sharing:

1. Try different things until it works or you run out of options

pro:
- no need for a single software component to know everything about
  every device in the system

con:
- who bothers with fallbacks, if the first try works on my system for
  my use case I test with? I.e. cost of code in users.
- trial-and-error can be very laborious (allocate, share with all
  devices, populate, test)
- the search space might be huge


2. Have a central component that knows what to do

pro:
- It might work on the first attempt, so no fallbacks in users.
- It might be optimal.

con:
- You need a software component that knows everything about every
  single combination of hardware in existence, multiplied by use cases.


Neither seems good, which brings us back to 
https://github.com/cubanismo/allocator .


> > 2. We don't properly communicate allocation requirements to userspace.
> >
> > E.g. even if you allocate from DMA-Heaps userspace can currently only
> > guess if normal, CMA or even device specific memory is needed.  
> 
> DMA-buf heaps actually make it even more difficult for the userspace,
> because now it needs to pick the right heap. With allocation built
> into the specific UAPI (like V4L2), it's at least possible to allocate
> for one specific device without having any knowledge about allocation
> constraints in the userspace.
> 
> >
> > 3. We seem to lack some essential parts of those restrictions in the
> > documentation.
> >  
> 
> Ack.
> 
> >  So if a device driver uses cached system memory on an architecture 
> >  which
> >  devices which can't access it the right approach is clearly to reject
> >  the access.  
> > >>> I'd like to accent the fact that "requires cache maintenance" != "can't 
> > >>> access".  
> > >> Well that depends. As said above the exporter exports the buffer as it
> > >> was allocated.
> > >>
> > >> If that means the the exporter provides a piece of 

Re: Try to address the DMA-buf coherency problem

2022-12-09 Thread Tomasz Figa
On Mon, Dec 5, 2022 at 5:29 PM Christian König  wrote:
>
> Hi Tomasz,
>
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> >> In other words explicit ownership transfer is not something we would
> >> want as requirement in the framework, cause otherwise we break tons of
> >> use cases which require concurrent access to the underlying buffer.
> >>
> >> When a device driver needs explicit ownership transfer it's perfectly
> >> possible to implement this using the dma_fence objects mentioned above.
> >> E.g. drivers can already look at who is accessing a buffer currently and
> >> can even grab explicit ownership of it by adding their own dma_fence
> >> objects.
> >>
> >> The only exception is CPU based access, e.g. when something is written
> >> with the CPU a cache flush might be necessary and when something is read
> >> with the CPU a cache invalidation might be necessary.
> >>
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> >
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> >
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
>
> Well there are multiple problems here:
>
> 1. A lot of userspace applications/frameworks assume that it can
> allocate the buffer anywhere and it just works.
>
> This isn't true at all, we have tons of cases where device can only
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia
> supports system memory here. Similar cases exists for audio/video codecs
> where intermediate memory is only accessible by certain devices because
> of content protection.

Ack.

Although I think the most common case on mainstream Linux today is
properly allocating for device X (e.g. V4L2 video decoder or DRM-based
GPU) and hoping that other devices would accept the buffers just fine,
which isn't a given on most platforms (although often it's just about
pixel format, width/height/stride alignment, tiling, etc. rather than
the memory itself). That's why ChromiumOS has minigbm and Android has
gralloc that act as the central point of knowledge on buffer
allocation.

>
> 2. We don't properly communicate allocation requirements to userspace.
>
> E.g. even if you allocate from DMA-Heaps userspace can currently only
> guess if normal, CMA or even device specific memory is needed.

DMA-buf heaps actually make it even more difficult for the userspace,
because now it needs to pick the right heap. With allocation built
into the specific UAPI (like V4L2), it's at least possible to allocate
for one specific device without having any knowledge about allocation
constraints in the userspace.

>
> 3. We seem to lack some essential parts of those restrictions in the
> documentation.
>

Ack.

>  So if a device driver uses cached system memory on an architecture which
>  devices which can't access it the right approach is clearly to reject
>  the access.
> >>> I'd like to accent the fact that "requires cache maintenance" != "can't 
> >>> access".
> >> Well that depends. As said above the exporter exports the buffer as it
> >> was allocated.
> >>
> >> If that means the the exporter provides a piece of memory which requires
> >> CPU cache snooping to access correctly then the best thing we can do is
> >> to prevent an importer which can't do this from attaching.
> > Could you elaborate more about this case? Does it exist in practice?
> > Do I assume correctly that it's about sharing a buffer between one DMA
> > engine that is cache-coherent and another that is non-coherent, where
> > the first one ends up having its accesses always go through some kind
> > of a cache (CPU cache, L2/L3/... cache, etc.)?
>
> Yes, exactly that. What happens in this particular use case is that one
> device driver wrote to it's internal buffer with the CPU (so some cache
> lines where dirty) and then a device which couldn't deal with that tried
> to access it.

If so, shouldn't that driver surround its CPU accesses with
begin/end_cpu_access() in the first place?

The case that I was suggesting was of a hardware block that actually
sits behind the CPU cache and thus dirties it on writes, not the
driver doing that. (I haven't personally encountered such a system,
though.)

>
> We could say that all device drivers must always look at the coherency
> of the devices which want to access their buffers. But that would
> horrible complicate things for maintaining the drivers because then
> drivers would need to take into account requirements from other drivers
> while allocating their internal buffers.

I think it's partially why we 

Re: Try to address the DMA-buf coherency problem

2022-12-06 Thread Christian König

Am 06.12.22 um 19:26 schrieb Nicolas Dufresne:

Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

[SNIP]

In other words explicit ownership transfer is not something we would
want as requirement in the framework, cause otherwise we break tons of
use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly
possible to implement this using the dma_fence objects mentioned above.
E.g. drivers can already look at who is accessing a buffer currently and
can even grab explicit ownership of it by adding their own dma_fence
objects.

The only exception is CPU based access, e.g. when something is written
with the CPU a cache flush might be necessary and when something is read
with the CPU a cache invalidation might be necessary.


Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)

Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can
allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but
why do you think userspace assumes this ? Did you actually read code that does
this (that isn't meant to run on controlled environment).


Yes, absolutely.

Kodi for example used to do this and it was actually me who pointed out 
that this whole approach is flawed in the first place.


And we had tons of customer projects which ran into trouble with that. 
It became better in the last few years, but only after pushing back on 
this many many times.


Regards,
Christian.


  And can you provide
some example of broken generic userspace ? The DMABuf flow is meant to be trial
and error. At least in GStreamer, yes, mostly only device allocation (when
genericly usable) is implemented, but the code that has been contribute will try
and fallback back like documented. Still fails sometimes, but that's exactly the
kind of kernel bugs your patchset is trying to address. I don't blame anyone
here, since why would folks on GStreamer/FFMPEG or any other "generic media
framework" spend so much time implement "per linux device code", when non-
embedded (constraint) linux is just handful of users (compare to Windows,
Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
always enable (or allow normal users to access) heaps (though you point 2. gets
in the way) ? Unlike virtual memory, I don't think there is very good accounting
and reclaiming mechanism for that memory, hence opening these means any
userspace could possibly impair the system functioning. If you can't e.g. limit
their usage within containers, this is pretty difficult for generic linux to
carry. This is a wider problem of course, which likely affect a lot of GPU usage
too, but perhaps it should be in the lower priority part of the todo.


This isn't true at all, we have tons of cases where device can only
access their special memory for certain use cases.
Just look at scanout for displaying on dGPU, neither AMD nor NVidia
supports system memory here. Similar cases exists for audio/video codecs
where intermediate memory is only accessible by certain devices because
of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also
not really a thing upstream yet from what I'm aware of. This needs unified
design and documentation imho, but also enough standardisation so that a generic
application can use it. Right now, content protection people have been
complaining that V4L2 (and most generic userspace) don't work with their design,
rather then trying to figure-out a design that works with existing API.


2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only
guess if normal, CMA or even device specific memory is needed.

3. We seem to lack some essential parts of those restrictions in the
documentation.

Agreed (can't always disagree).

regards,
Nicolas


So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it
was allocated.

If 

Re: Try to address the DMA-buf coherency problem

2022-12-06 Thread Nicolas Dufresne
Le lundi 05 décembre 2022 à 09:28 +0100, Christian König a écrit :
> Hi Tomasz,
> 
> Am 05.12.22 um 07:41 schrieb Tomasz Figa:
> > [SNIP]
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > > 
> > > When a device driver needs explicit ownership transfer it's perfectly
> > > possible to implement this using the dma_fence objects mentioned above.
> > > E.g. drivers can already look at who is accessing a buffer currently and
> > > can even grab explicit ownership of it by adding their own dma_fence
> > > objects.
> > > 
> > > The only exception is CPU based access, e.g. when something is written
> > > with the CPU a cache flush might be necessary and when something is read
> > > with the CPU a cache invalidation might be necessary.
> > > 
> > Okay, that's much clearer now, thanks for clarifying this. So we
> > should be covered for the cache maintenance needs originating from CPU
> > accesses already, +/- the broken cases which don't call the begin/end
> > CPU access routines that I mentioned above.
> > 
> > Similarly, for any ownership transfer between different DMA engines,
> > we should be covered either by the userspace explicitly flushing the
> > hardware pipeline or attaching a DMA-buf fence to the buffer.
> > 
> > But then, what's left to be solved? :) (Besides the cases of missing
> > begin/end CPU access calls.)
> 
> Well there are multiple problems here:
> 
> 1. A lot of userspace applications/frameworks assume that it can 
> allocate the buffer anywhere and it just works.

I know you have said that about 10 times, perhaps I'm about to believe it, but
why do you think userspace assumes this ? Did you actually read code that does
this (that isn't meant to run on controlled environment). And can you provide
some example of broken generic userspace ? The DMABuf flow is meant to be trial
and error. At least in GStreamer, yes, mostly only device allocation (when
genericly usable) is implemented, but the code that has been contribute will try
and fallback back like documented. Still fails sometimes, but that's exactly the
kind of kernel bugs your patchset is trying to address. I don't blame anyone
here, since why would folks on GStreamer/FFMPEG or any other "generic media
framework" spend so much time implement "per linux device code", when non-
embedded (constraint) linux is just handful of users (compare to Windows,
Android, iOS users).

To me, this shouldn't be #1 issue. Perhaps it should simply be replaced by
userspace not supporting DMABuf Heaps. Perhaps add that Linux distribution don't
always enable (or allow normal users to access) heaps (though you point 2. gets
in the way) ? Unlike virtual memory, I don't think there is very good accounting
and reclaiming mechanism for that memory, hence opening these means any
userspace could possibly impair the system functioning. If you can't e.g. limit
their usage within containers, this is pretty difficult for generic linux to
carry. This is a wider problem of course, which likely affect a lot of GPU usage
too, but perhaps it should be in the lower priority part of the todo.

> 
> This isn't true at all, we have tons of cases where device can only 
> access their special memory for certain use cases.
> Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
> supports system memory here. Similar cases exists for audio/video codecs 
> where intermediate memory is only accessible by certain devices because 
> of content protection.

nit: content protection is not CODEC specific, its a platform feature, its also
not really a thing upstream yet from what I'm aware of. This needs unified
design and documentation imho, but also enough standardisation so that a generic
application can use it. Right now, content protection people have been
complaining that V4L2 (and most generic userspace) don't work with their design,
rather then trying to figure-out a design that works with existing API.

> 
> 2. We don't properly communicate allocation requirements to userspace.
> 
> E.g. even if you allocate from DMA-Heaps userspace can currently only 
> guess if normal, CMA or even device specific memory is needed.
> 
> 3. We seem to lack some essential parts of those restrictions in the 
> documentation.

Agreed (can't always disagree).

regards,
Nicolas

> 
> > > > > So if a device driver uses cached system memory on an architecture 
> > > > > which
> > > > > devices which can't access it the right approach is clearly to reject
> > > > > the access.
> > > > I'd like to accent the fact that "requires cache maintenance" != "can't 
> > > > access".
> > > Well that depends. As said above the exporter exports the buffer as it
> > > was allocated.
> > > 
> > > If that means the the exporter provides a piece of memory which requires
> > > CPU cache snooping to access correctly then the best 

Re: Try to address the DMA-buf coherency problem

2022-12-05 Thread Christian König

Hi Tomasz,

Am 05.12.22 um 07:41 schrieb Tomasz Figa:

[SNIP]

In other words explicit ownership transfer is not something we would
want as requirement in the framework, cause otherwise we break tons of
use cases which require concurrent access to the underlying buffer.

When a device driver needs explicit ownership transfer it's perfectly
possible to implement this using the dma_fence objects mentioned above.
E.g. drivers can already look at who is accessing a buffer currently and
can even grab explicit ownership of it by adding their own dma_fence
objects.

The only exception is CPU based access, e.g. when something is written
with the CPU a cache flush might be necessary and when something is read
with the CPU a cache invalidation might be necessary.


Okay, that's much clearer now, thanks for clarifying this. So we
should be covered for the cache maintenance needs originating from CPU
accesses already, +/- the broken cases which don't call the begin/end
CPU access routines that I mentioned above.

Similarly, for any ownership transfer between different DMA engines,
we should be covered either by the userspace explicitly flushing the
hardware pipeline or attaching a DMA-buf fence to the buffer.

But then, what's left to be solved? :) (Besides the cases of missing
begin/end CPU access calls.)


Well there are multiple problems here:

1. A lot of userspace applications/frameworks assume that it can 
allocate the buffer anywhere and it just works.


This isn't true at all, we have tons of cases where device can only 
access their special memory for certain use cases.
Just look at scanout for displaying on dGPU, neither AMD nor NVidia 
supports system memory here. Similar cases exists for audio/video codecs 
where intermediate memory is only accessible by certain devices because 
of content protection.


2. We don't properly communicate allocation requirements to userspace.

E.g. even if you allocate from DMA-Heaps userspace can currently only 
guess if normal, CMA or even device specific memory is needed.


3. We seem to lack some essential parts of those restrictions in the 
documentation.



So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

Well that depends. As said above the exporter exports the buffer as it
was allocated.

If that means the the exporter provides a piece of memory which requires
CPU cache snooping to access correctly then the best thing we can do is
to prevent an importer which can't do this from attaching.

Could you elaborate more about this case? Does it exist in practice?
Do I assume correctly that it's about sharing a buffer between one DMA
engine that is cache-coherent and another that is non-coherent, where
the first one ends up having its accesses always go through some kind
of a cache (CPU cache, L2/L3/... cache, etc.)?


Yes, exactly that. What happens in this particular use case is that one 
device driver wrote to it's internal buffer with the CPU (so some cache 
lines where dirty) and then a device which couldn't deal with that tried 
to access it.


We could say that all device drivers must always look at the coherency 
of the devices which want to access their buffers. But that would 
horrible complicate things for maintaining the drivers because then 
drivers would need to take into account requirements from other drivers 
while allocating their internal buffers.


That's essentially the reason why we have DMA-buf heaps. Those heaps 
expose system memory buffers with certain properties (size, CMA, DMA bit 
restrictions etc...) and also implement the callback functions for CPU 
cache maintenance.



We do have the system and CMA dma-buf heap for cases where a device
driver which wants to access the buffer with caches enabled. So this is
not a limitation in functionality, it's just a matter of correctly using it.


V4L2 also has the ability to allocate buffers and map them with caches enabled.


Yeah, when that's a requirement for the V4L2 device it also makes 
perfect sense.


It's just that we shouldn't force any specific allocation behavior on a 
device driver because of the requirements of a different device.


Giving an example a V4L2 device shouldn't be forced to use 
videobuf2-dma-contig because some other device needs CMA. Instead we 
should use the common DMA-buf heaps which implement this as neutral 
supplier of system memory buffers.



The problem is that in this particular case the exporter provides the
buffer as is, e.g. with dirty CPU caches. And the importer can't deal
with that.

Why does the exporter leave the buffer with dirty CPU caches?


Because exporters always export the buffers as they would use it. And in 
this case that means writing with the CPU to it.



Either reverting the roles of the importer or exporter or switching over
to the common DMA-heaps 

Re: Try to address the DMA-buf coherency problem

2022-12-04 Thread Tomasz Figa
Hi Christian,

On Thu, Nov 17, 2022 at 9:11 PM Christian König
 wrote:
>
> Hi Tomasz,
>
> Am 17.11.22 um 10:35 schrieb Tomasz Figa:
> > Hi Christian and everyone,
> >
> > On Thu, Nov 3, 2022 at 4:14 AM Christian König
> >  wrote:
> >> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> >>> Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> >>> [SNIP]
>  It would just be doing this for the importer and exactly that
>  would be bad design because we then have handling for the display driver
>  outside of the driver.
> 
> >>> The driver would have to do those cache maintenance operations if it
> >>> directly worked with a non-coherent device. Doing it for the importer
> >>> is just doing it for another device, not the one directly managed by
> >>> the exporter.
> >>>
> >>> I really don't see the difference to the other dma-buf ops: in
> >>> dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> >>> the address space of the importer. Why would cache maintenance be any
> >>> different?
> >> The issue here is the explicit ownership transfer.
> >>
> >> We intentionally decided against that because it breaks tons of use
> >> cases and is at least by me and a couple of others seen as generally
> >> design failure of the Linux DMA-API.
> > First of all, thanks for starting the discussion and sorry for being
> > late to the party. May I ask you to keep me on CC for any changes that
> > touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
> > being copied on the entire series, no need to pick the specific
> > patches. Thanks in advance.
>
> Sorry for that, I've only added people involved in the previous
> discussion. Going to keep you in the loop.
>

No worries. Thanks.

Sorry, for being late with the reply, had a bit of vacation and then
some catching up last week.

> > I agree that we have some design issues in the current DMA-buf
> > framework, but I'd try to approach it a bit differently. Instead of
> > focusing on the issues in the current design, could we write down our
> > requirements and try to come up with how a correct design would look
> > like? (A lot of that has been already mentioned in this thread, but I
> > find it quite difficult to follow and it might not be a complete view
> > either.)
>
> Well, exactly that's what we disagree on.
>
> As far as I can see the current design of DMA-buf is perfectly capable
> of handling all the requirements.
>
> A brief summary of the requirements with some implementation notes:
>
> 1. Device drivers export their memory as it is. E.g. no special handling
> for importers on the exporter side.
>  If an importer can't deal with a specific format, layout, caching
> etc... of the data the correct approach is to reject the attachment.
>  Those parameters are controlled by userspace and negotiating them
> is explicitly not part of the framework.

Ack. I believe it matches the current implementation of the DMA-buf
framework, although as you mentioned, the swiotlb feature of the DMA
mapping framework kind of violates this.

>
> 2. Accesses of the CPU to a buffer are bracket int begin_cpu_access()
> and end_cpu_access() calls.
>  Here we can insert the CPU cache invalidation/flushes as necessary.

Ack. I think a part of the problem today is that there exist userspace
and kernel code instances that don't insert them and assume that some
magic keeps the cache clean...

>
> 3. Accesses of the device HW to a buffer are represented with dma_fence
> objects.
>  It's explicitly allowed to have multiple devices access the buffer
> at the same time.
>

Ack. Again there exists kernel code that doesn't honor or provide DMA
fences (e.g. V4L2).

> 4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
>  In other words the exporter provides a bunch of DMA addresses the
> importer should access.
>  The importer should not try to come up with those on its own.
>
> > That said, let me address a few aspects already mentioned, to make
> > sure that everyone is on the same page.
> >
> >> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >> be able to directly decided where a certain operation should go. E.g. we
> >> have cases where for example a P2P write doesn't even go to memory, but
> >> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >> round trips for explicit ownership transfer completely breaks that concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
>
> No, exactly that's the concept I'm pushing back on very hard here.
>
> In other words explicit ownership transfer is not something we would
> want as requirement in the framework, cause otherwise we break tons of
> use cases which require concurrent access to the underlying buffer.
>
> When a device driver needs explicit ownership transfer 

Re: Try to address the DMA-buf coherency problem

2022-12-02 Thread Christian König

Am 30.11.22 um 11:30 schrieb Daniel Vetter:

On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:

Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :

On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:

On Tue, 22 Nov 2022 18:33:59 +0100
Christian König  wrote:


We should have come up with dma-heaps earlier and make it clear that
exporting a DMA-buf from a device gives you something device specific
which might or might not work with others.

Apart from that I agree, DMA-buf should be capable of handling this.
Question left is what documentation is missing to make it clear how
things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been
forgotten:
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Fdri-devel%2F20210905122742.86029-1-daniels%40collabora.com%2Fdata=05%7C01%7Cchristian.koenig%40amd.com%7C45786a08e6dc4af2816508dad2bdf957%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C638054011293521624%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7Csdata=GjsoJGNoPozTS2SWeeirURzQatI5vfl9%2B%2BfzoavgTbw%3Dreserved=0

It aimed mostly at userspace, but sounds to me like the coherency stuff
could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be
completely wrong user-space behaviour in this thread. But it seems to pre-date
the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
vs system memory issue. But it also underline a very important aspect, that
userland is not aware which one to use. What this document suggest though seems
more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM
mailing list.

The doc is more about document the current status quo/best practices,
which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to
use for shared buffers, plus not all allocators are exposed through heaps
to begin with. We had this noted as a todo item (add some device->heap
sysfs links was the idea), until that's done all you can do is hardcode
the right heaps for the right usage in userspace, which is what android
does. Plus android doesn't have dgpu, so doesn't need the missing ttm
heap.


Hui? Why do you think we should have a TTM heap in the first place?

As far as I can see we need three different ways of allocation:
1. Normal system memory.
2. CMA based.
3. Device specific.

When any of the participating devices needs CMA then user space must use 
the CMA heap, when any of the participating devices needs device 
specific memory then user space must use device specific memory (from 
that device).


It still can be that two devices can't talk with each other because both 
needs the buffer to be allocated from device specific memory for a 
particular use case, but at least all the cases for both none CPU cache 
coherent ARM devices as well as device specific memory for scanout 
should be handled with this.


Regards,
Christian.



But yeah the long-term aspiration also hasn't changed, because the
dma-heap todo list is also very, very old by now :-/
-Daniel




Re: Try to address the DMA-buf coherency problem

2022-11-30 Thread Daniel Vetter
On Fri, Nov 25, 2022 at 11:40:04AM -0500, Nicolas Dufresne wrote:
> Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> > On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > > On Tue, 22 Nov 2022 18:33:59 +0100
> > > Christian König  wrote:
> > > 
> > > > We should have come up with dma-heaps earlier and make it clear that 
> > > > exporting a DMA-buf from a device gives you something device specific 
> > > > which might or might not work with others.
> > > > 
> > > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > > Question left is what documentation is missing to make it clear how 
> > > > things are supposed to work?
> > > 
> > > Perhaps somewhat related from Daniel Stone that seems to have been
> > > forgotten:
> > > https://lore.kernel.org/dri-devel/20210905122742.86029-1-dani...@collabora.com/
> > > 
> > > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > > could use a section of its own there?
> > 
> > Hm yeah it would be great to land that and then eventually extend. Daniel?
> 
> There is a lot of things documented in this document that have been said to be
> completely wrong user-space behaviour in this thread. But it seems to pre-date
> the DMA Heaps. The document also assume that DMA Heaps completely solves the 
> CMA
> vs system memory issue. But it also underline a very important aspect, that
> userland is not aware which one to use. What this document suggest though 
> seems
> more realist then what has been said here.
> 
> Its overall a great document, it unfortunate that it only makes it into the 
> DRM
> mailing list.

The doc is more about document the current status quo/best practices,
which is very much not using dma-heaps.

The issue there is that currently userspace has no idea which dma-heap to
use for shared buffers, plus not all allocators are exposed through heaps
to begin with. We had this noted as a todo item (add some device->heap
sysfs links was the idea), until that's done all you can do is hardcode
the right heaps for the right usage in userspace, which is what android
does. Plus android doesn't have dgpu, so doesn't need the missing ttm
heap.

But yeah the long-term aspiration also hasn't changed, because the
dma-heap todo list is also very, very old by now :-/
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: Try to address the DMA-buf coherency problem

2022-11-25 Thread Nicolas Dufresne
Le mercredi 23 novembre 2022 à 17:33 +0100, Daniel Vetter a écrit :
> On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> > On Tue, 22 Nov 2022 18:33:59 +0100
> > Christian König  wrote:
> > 
> > > We should have come up with dma-heaps earlier and make it clear that 
> > > exporting a DMA-buf from a device gives you something device specific 
> > > which might or might not work with others.
> > > 
> > > Apart from that I agree, DMA-buf should be capable of handling this. 
> > > Question left is what documentation is missing to make it clear how 
> > > things are supposed to work?
> > 
> > Perhaps somewhat related from Daniel Stone that seems to have been
> > forgotten:
> > https://lore.kernel.org/dri-devel/20210905122742.86029-1-dani...@collabora.com/
> > 
> > It aimed mostly at userspace, but sounds to me like the coherency stuff
> > could use a section of its own there?
> 
> Hm yeah it would be great to land that and then eventually extend. Daniel?

There is a lot of things documented in this document that have been said to be
completely wrong user-space behaviour in this thread. But it seems to pre-date
the DMA Heaps. The document also assume that DMA Heaps completely solves the CMA
vs system memory issue. But it also underline a very important aspect, that
userland is not aware which one to use. What this document suggest though seems
more realist then what has been said here.

Its overall a great document, it unfortunate that it only makes it into the DRM
mailing list.

Nicolas


Re: Try to address the DMA-buf coherency problem

2022-11-23 Thread Daniel Vetter
On Wed, Nov 23, 2022 at 10:33:38AM +0200, Pekka Paalanen wrote:
> On Tue, 22 Nov 2022 18:33:59 +0100
> Christian König  wrote:
> 
> > We should have come up with dma-heaps earlier and make it clear that 
> > exporting a DMA-buf from a device gives you something device specific 
> > which might or might not work with others.
> > 
> > Apart from that I agree, DMA-buf should be capable of handling this. 
> > Question left is what documentation is missing to make it clear how 
> > things are supposed to work?
> 
> Perhaps somewhat related from Daniel Stone that seems to have been
> forgotten:
> https://lore.kernel.org/dri-devel/20210905122742.86029-1-dani...@collabora.com/
> 
> It aimed mostly at userspace, but sounds to me like the coherency stuff
> could use a section of its own there?

Hm yeah it would be great to land that and then eventually extend. Daniel?
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: Try to address the DMA-buf coherency problem

2022-11-23 Thread Pekka Paalanen
On Tue, 22 Nov 2022 18:33:59 +0100
Christian König  wrote:

> We should have come up with dma-heaps earlier and make it clear that 
> exporting a DMA-buf from a device gives you something device specific 
> which might or might not work with others.
> 
> Apart from that I agree, DMA-buf should be capable of handling this. 
> Question left is what documentation is missing to make it clear how 
> things are supposed to work?

Perhaps somewhat related from Daniel Stone that seems to have been
forgotten:
https://lore.kernel.org/dri-devel/20210905122742.86029-1-dani...@collabora.com/

It aimed mostly at userspace, but sounds to me like the coherency stuff
could use a section of its own there?


Thanks,
pq


pgp6RaDw237gs.pgp
Description: OpenPGP digital signature


Re: Try to address the DMA-buf coherency problem

2022-11-22 Thread Daniel Vetter
On Tue, 22 Nov 2022 at 18:34, Christian König  wrote:
> Am 22.11.22 um 15:36 schrieb Daniel Vetter:
> > On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> >> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne  
> >> wrote:
> >>> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> >> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> >> be able to directly decided where a certain operation should go. E.g. 
> >> we
> >> have cases where for example a P2P write doesn't even go to memory, but
> >> rather a doorbell BAR to trigger another operation. Throwing in CPU
> >> round trips for explicit ownership transfer completely breaks that
> >> concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
>  No, exactly that's the concept I'm pushing back on very hard here.
> 
>  In other words explicit ownership transfer is not something we would
>  want as requirement in the framework, cause otherwise we break tons of
>  use cases which require concurrent access to the underlying buffer.
> >>> I'm not pushing for this solution, but really felt the need to correct 
> >>> you here.
> >>> I have quite some experience with ownership transfer mechanism, as this 
> >>> is how
> >>> GStreamer framework works since 2000. Concurrent access is a really 
> >>> common use
> >>> cases and it is quite well defined in that context. The bracketing system 
> >>> (in
> >>> this case called map() unmap(), with flag stating the usage intention 
> >>> like reads
> >>> and write) is combined the the refcount. The basic rules are simple:
> >> This is all CPU oriented, I think Christian is talking about the case
> >> where ownership transfer happens without CPU involvement, such as via
> >> GPU waiting on a fence
> > Yeah for pure device2device handover the rule pretty much has to be that
> > any coherency management that needs to be done must be done from the
> > device side (flushing device side caches and stuff like that) only. But
> > under the assumption that _all_ cpu side management has been done already
> > before the first device access started.
> >
> > And then the map/unmap respectively begin/end_cpu_access can be used what
> > it was meant for with cpu side invalidation/flushing and stuff like that,
> > while having pretty clear handover/ownership rules and hopefully not doing
> > no unecessary flushes. And all that while allowing device acces to be
> > pipelined. Worst case the exporter has to insert some pipelined cache
> > flushes as a dma_fence pipelined work of its own between the device access
> > when moving from one device to the other. That last part sucks a bit right
> > now, because we don't have any dma_buf_attachment method which does this
> > syncing without recreating the mapping, but in reality this is solved by
> > caching mappings in the exporter (well dma-buf layer) nowadays.
> >
> > True concurrent access like vk/compute expects is still a model that
> > dma-buf needs to support on top, but that's a special case and pretty much
> > needs hw that supports such concurrent access without explicit handover
> > and fencing.
> >
> > Aside from some historical accidents and still a few warts, I do think
> > dma-buf does support both of these models.
>
> We should have come up with dma-heaps earlier and make it clear that
> exporting a DMA-buf from a device gives you something device specific
> which might or might not work with others.

Yeah, but engineering practicalities were pretty clear that no one
would rewrite the entire Xorg stack and all the drivers just to make
that happen for prime.

> Apart from that I agree, DMA-buf should be capable of handling this.
> Question left is what documentation is missing to make it clear how
> things are supposed to work?

Given the historical baggage of existing use-case, I think the only
way out is that we look at concrete examples from real world users
that break, and figure out how to fix them. Without breaking any of
the existing mess.

One idea might be that we have a per-platform
dma_buf_legacy_coherency_mode(), which tells you what mode (cpu cache
snooping or uncached memory) you need to use to make sure that all
devices agree. On x86 the rule might be that it's cpu cache snooping
by default, but if you have an integrated gpu then everyone needs to
use uncached. That sucks, but at least we could keep the existing mess
going and clean it up. Everyone else would be uncached, except maybe
arm64 servers with pcie connectors. Essentially least common
denominator to make this work. Note that uncached actually means
device access doesn't snoop, the cpu side you can handle with either
uc/wc mappings or explicit flushing.

Then once we have that we could implement the coherency negotiation
protocol on top as an explicit opt-in, 

Re: Try to address the DMA-buf coherency problem

2022-11-22 Thread Christian König

Am 22.11.22 um 15:36 schrieb Daniel Vetter:

On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:

On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne  wrote:

Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :

DMA-Buf let's the exporter setup the DMA addresses the importer uses to
be able to directly decided where a certain operation should go. E.g. we
have cases where for example a P2P write doesn't even go to memory, but
rather a doorbell BAR to trigger another operation. Throwing in CPU
round trips for explicit ownership transfer completely breaks that
concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer.

No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would
want as requirement in the framework, cause otherwise we break tons of
use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here.
I have quite some experience with ownership transfer mechanism, as this is how
GStreamer framework works since 2000. Concurrent access is a really common use
cases and it is quite well defined in that context. The bracketing system (in
this case called map() unmap(), with flag stating the usage intention like reads
and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case
where ownership transfer happens without CPU involvement, such as via
GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that
any coherency management that needs to be done must be done from the
device side (flushing device side caches and stuff like that) only. But
under the assumption that _all_ cpu side management has been done already
before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what
it was meant for with cpu side invalidation/flushing and stuff like that,
while having pretty clear handover/ownership rules and hopefully not doing
no unecessary flushes. And all that while allowing device acces to be
pipelined. Worst case the exporter has to insert some pipelined cache
flushes as a dma_fence pipelined work of its own between the device access
when moving from one device to the other. That last part sucks a bit right
now, because we don't have any dma_buf_attachment method which does this
syncing without recreating the mapping, but in reality this is solved by
caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that
dma-buf needs to support on top, but that's a special case and pretty much
needs hw that supports such concurrent access without explicit handover
and fencing.

Aside from some historical accidents and still a few warts, I do think
dma-buf does support both of these models.


We should have come up with dma-heaps earlier and make it clear that 
exporting a DMA-buf from a device gives you something device specific 
which might or might not work with others.


Apart from that I agree, DMA-buf should be capable of handling this. 
Question left is what documentation is missing to make it clear how 
things are supposed to work?


Regards,
Christian.


  Of course in the case of
gpu/drm drivers, userspace must know whats possible and act accordingly,
otherwise you just get to keep all the pieces.
-Daniel




Re: Try to address the DMA-buf coherency problem

2022-11-22 Thread Daniel Vetter
On Fri, Nov 18, 2022 at 11:32:19AM -0800, Rob Clark wrote:
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne  wrote:
> >
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses 
> > > > > to
> > > > > be able to directly decided where a certain operation should go. E.g. 
> > > > > we
> > > > > have cases where for example a P2P write doesn't even go to memory, 
> > > > > but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > >
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > >
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> >
> > I'm not pushing for this solution, but really felt the need to correct you 
> > here.
> > I have quite some experience with ownership transfer mechanism, as this is 
> > how
> > GStreamer framework works since 2000. Concurrent access is a really common 
> > use
> > cases and it is quite well defined in that context. The bracketing system 
> > (in
> > this case called map() unmap(), with flag stating the usage intention like 
> > reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

Yeah for pure device2device handover the rule pretty much has to be that
any coherency management that needs to be done must be done from the
device side (flushing device side caches and stuff like that) only. But
under the assumption that _all_ cpu side management has been done already
before the first device access started.

And then the map/unmap respectively begin/end_cpu_access can be used what
it was meant for with cpu side invalidation/flushing and stuff like that,
while having pretty clear handover/ownership rules and hopefully not doing
no unecessary flushes. And all that while allowing device acces to be
pipelined. Worst case the exporter has to insert some pipelined cache
flushes as a dma_fence pipelined work of its own between the device access
when moving from one device to the other. That last part sucks a bit right
now, because we don't have any dma_buf_attachment method which does this
syncing without recreating the mapping, but in reality this is solved by
caching mappings in the exporter (well dma-buf layer) nowadays.

True concurrent access like vk/compute expects is still a model that
dma-buf needs to support on top, but that's a special case and pretty much
needs hw that supports such concurrent access without explicit handover
and fencing.

Aside from some historical accidents and still a few warts, I do think
dma-buf does support both of these models. Of course in the case of
gpu/drm drivers, userspace must know whats possible and act accordingly,
otherwise you just get to keep all the pieces.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: Try to address the DMA-buf coherency problem

2022-11-19 Thread Nicolas Dufresne
Le vendredi 18 novembre 2022 à 11:32 -0800, Rob Clark a écrit :
> On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne  wrote:
> > 
> > Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses 
> > > > > to
> > > > > be able to directly decided where a certain operation should go. E.g. 
> > > > > we
> > > > > have cases where for example a P2P write doesn't even go to memory, 
> > > > > but
> > > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > > round trips for explicit ownership transfer completely breaks that
> > > > > concept.
> > > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > > accepts two (or an array?) of devices and tells the caller whether the
> > > > devices need explicit ownership transfer.
> > > 
> > > No, exactly that's the concept I'm pushing back on very hard here.
> > > 
> > > In other words explicit ownership transfer is not something we would
> > > want as requirement in the framework, cause otherwise we break tons of
> > > use cases which require concurrent access to the underlying buffer.
> > 
> > I'm not pushing for this solution, but really felt the need to correct you 
> > here.
> > I have quite some experience with ownership transfer mechanism, as this is 
> > how
> > GStreamer framework works since 2000. Concurrent access is a really common 
> > use
> > cases and it is quite well defined in that context. The bracketing system 
> > (in
> > this case called map() unmap(), with flag stating the usage intention like 
> > reads
> > and write) is combined the the refcount. The basic rules are simple:
> 
> This is all CPU oriented, I think Christian is talking about the case
> where ownership transfer happens without CPU involvement, such as via
> GPU waiting on a fence

HW fences and proper ownership isn't incompatible at all. Even if you have no
software involved during the usage, software still need to share the dmabuf (at
least once), and sharing modify the ownership, and can be made explicit.

p.s. I will agree if someone raises that this is totally off topic

Nicolas
> BR,
> -R



Re: Try to address the DMA-buf coherency problem

2022-11-18 Thread Rob Clark
On Thu, Nov 17, 2022 at 7:38 AM Nicolas Dufresne  wrote:
>
> Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > > be able to directly decided where a certain operation should go. E.g. we
> > > > have cases where for example a P2P write doesn't even go to memory, but
> > > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > > round trips for explicit ownership transfer completely breaks that
> > > > concept.
> > > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > > accepts two (or an array?) of devices and tells the caller whether the
> > > devices need explicit ownership transfer.
> >
> > No, exactly that's the concept I'm pushing back on very hard here.
> >
> > In other words explicit ownership transfer is not something we would
> > want as requirement in the framework, cause otherwise we break tons of
> > use cases which require concurrent access to the underlying buffer.
>
> I'm not pushing for this solution, but really felt the need to correct you 
> here.
> I have quite some experience with ownership transfer mechanism, as this is how
> GStreamer framework works since 2000. Concurrent access is a really common use
> cases and it is quite well defined in that context. The bracketing system (in
> this case called map() unmap(), with flag stating the usage intention like 
> reads
> and write) is combined the the refcount. The basic rules are simple:

This is all CPU oriented, I think Christian is talking about the case
where ownership transfer happens without CPU involvement, such as via
GPU waiting on a fence

BR,
-R


Re: Try to address the DMA-buf coherency problem

2022-11-17 Thread Nicolas Dufresne
Le jeudi 17 novembre 2022 à 13:10 +0100, Christian König a écrit :
> > > DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> > > be able to directly decided where a certain operation should go. E.g. we
> > > have cases where for example a P2P write doesn't even go to memory, but
> > > rather a doorbell BAR to trigger another operation. Throwing in CPU
> > > round trips for explicit ownership transfer completely breaks that
> > > concept.
> > It sounds like we should have a dma_dev_is_coherent_with_dev() which
> > accepts two (or an array?) of devices and tells the caller whether the
> > devices need explicit ownership transfer.
> 
> No, exactly that's the concept I'm pushing back on very hard here.
> 
> In other words explicit ownership transfer is not something we would 
> want as requirement in the framework, cause otherwise we break tons of 
> use cases which require concurrent access to the underlying buffer.

I'm not pushing for this solution, but really felt the need to correct you here.
I have quite some experience with ownership transfer mechanism, as this is how
GStreamer framework works since 2000. Concurrent access is a really common use
cases and it is quite well defined in that context. The bracketing system (in
this case called map() unmap(), with flag stating the usage intention like reads
and write) is combined the the refcount. The basic rules are simple:

- An object with a refcount higher then 2 is shared, hence read-only
- An object with refcount of one, mapped for writes becomes exclusive
- Non exclusive writes can be done, but that has to be explicit (intentional),
we didn't go as far as Rust in that domain
- Wrappers around these object can use mechanism like "copy-on-write" and can
also maintain the state of shadow buffers (e.g. GL upload slow cases) even with
concurrent access.

Just hope it clarify, Rust language works, yet its all based on explicit
ownership transfers. Its not limiting, but it requires a different way of
thinking how data is to be accessed.

Nicolas



Re: Try to address the DMA-buf coherency problem

2022-11-17 Thread Christian König

Hi Tomasz,

Am 17.11.22 um 10:35 schrieb Tomasz Figa:

Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König
 wrote:

Am 02.11.22 um 18:10 schrieb Lucas Stach:

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
[SNIP]

It would just be doing this for the importer and exactly that
would be bad design because we then have handling for the display driver
outside of the driver.


The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.

I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?

The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use
cases and is at least by me and a couple of others seen as generally
design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being
late to the party. May I ask you to keep me on CC for any changes that
touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
being copied on the entire series, no need to pick the specific
patches. Thanks in advance.


Sorry for that, I've only added people involved in the previous 
discussion. Going to keep you in the loop.



I agree that we have some design issues in the current DMA-buf
framework, but I'd try to approach it a bit differently. Instead of
focusing on the issues in the current design, could we write down our
requirements and try to come up with how a correct design would look
like? (A lot of that has been already mentioned in this thread, but I
find it quite difficult to follow and it might not be a complete view
either.)


Well, exactly that's what we disagree on.

As far as I can see the current design of DMA-buf is perfectly capable 
of handling all the requirements.


A brief summary of the requirements with some implementation notes:

1. Device drivers export their memory as it is. E.g. no special handling 
for importers on the exporter side.
    If an importer can't deal with a specific format, layout, caching 
etc... of the data the correct approach is to reject the attachment.
    Those parameters are controlled by userspace and negotiating them 
is explicitly not part of the framework.


2. Accesses of the CPU to a buffer are bracket int begin_cpu_access() 
and end_cpu_access() calls.

    Here we can insert the CPU cache invalidation/flushes as necessary.

3. Accesses of the device HW to a buffer are represented with dma_fence 
objects.
    It's explicitly allowed to have multiple devices access the buffer 
at the same time.


4. Access to the DMA-buf by the HW of an importer is setup by the exporter.
    In other words the exporter provides a bunch of DMA addresses the 
importer should access.

    The importer should not try to come up with those on its own.


That said, let me address a few aspects already mentioned, to make
sure that everyone is on the same page.


DMA-Buf let's the exporter setup the DMA addresses the importer uses to
be able to directly decided where a certain operation should go. E.g. we
have cases where for example a P2P write doesn't even go to memory, but
rather a doorbell BAR to trigger another operation. Throwing in CPU
round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer.


No, exactly that's the concept I'm pushing back on very hard here.

In other words explicit ownership transfer is not something we would 
want as requirement in the framework, cause otherwise we break tons of 
use cases which require concurrent access to the underlying buffer.


When a device driver needs explicit ownership transfer it's perfectly 
possible to implement this using the dma_fence objects mentioned above. 
E.g. drivers can already look at who is accessing a buffer currently and 
can even grab explicit ownership of it by adding their own dma_fence 
objects.


The only exception is CPU based access, e.g. when something is written 
with the CPU a cache flush might be necessary and when something is read 
with the CPU a cache invalidation might be necessary.



So if a device driver uses cached system memory on an architecture which
devices which can't access it the right approach is clearly to reject
the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".


Well that depends. As said above the exporter exports the buffer as it 
was allocated.


If that means the the exporter provides a piece of memory which requires 
CPU cache snooping to access correctly then the best thing we can do 

Re: Try to address the DMA-buf coherency problem

2022-11-17 Thread Tomasz Figa
Hi Christian and everyone,

On Thu, Nov 3, 2022 at 4:14 AM Christian König
 wrote:
>
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> >> It would just be doing this for the importer and exactly that
> >> would be bad design because we then have handling for the display driver
> >> outside of the driver.
> >>
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> >
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
>
> The issue here is the explicit ownership transfer.
>
> We intentionally decided against that because it breaks tons of use
> cases and is at least by me and a couple of others seen as generally
> design failure of the Linux DMA-API.

First of all, thanks for starting the discussion and sorry for being
late to the party. May I ask you to keep me on CC for any changes that
touch the V4L2 videobuf2 framework, as a maintainer of it? I'm okay
being copied on the entire series, no need to pick the specific
patches. Thanks in advance.

I agree that we have some design issues in the current DMA-buf
framework, but I'd try to approach it a bit differently. Instead of
focusing on the issues in the current design, could we write down our
requirements and try to come up with how a correct design would look
like? (A lot of that has been already mentioned in this thread, but I
find it quite difficult to follow and it might not be a complete view
either.)

That said, let me address a few aspects already mentioned, to make
sure that everyone is on the same page.

>
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to
> be able to directly decided where a certain operation should go. E.g. we
> have cases where for example a P2P write doesn't even go to memory, but
> rather a doorbell BAR to trigger another operation. Throwing in CPU
> round trips for explicit ownership transfer completely breaks that concept.

It sounds like we should have a dma_dev_is_coherent_with_dev() which
accepts two (or an array?) of devices and tells the caller whether the
devices need explicit ownership transfer. Based on that, your drivers
would install the DMA completion (presumably IRQ) handlers or not.
It's necessary since it's not uncommon that devices A and B could be
in the same coherency domain, while C could be in a different one, but
you may still want them to exchange data through DMA-bufs. Even if it
means the need for some extra round trips it would likely be more
efficient than a full memory copy (might not be true 100% of the
time).

>
> Additional to that a very basic concept of DMA-buf is that the exporter
> provides the buffer as it is and just double checks if the importer can
> access it. For example we have XGMI links which makes memory accessible
> to other devices on the same bus, but not to PCIe device and not even to
> the CPU. Otherwise you wouldn't be able to implement things like secure
> decoding where the data isn't even accessible outside the device to
> device link.

Fully agreed.

>
> So if a device driver uses cached system memory on an architecture which
> devices which can't access it the right approach is clearly to reject
> the access.

I'd like to accent the fact that "requires cache maintenance" != "can't access".

>
> What we can do is to reverse the role of the exporter and importer and
> let the device which needs uncached memory take control. This way this
> device can insert operations as needed, e.g. flush read caches or
> invalidate write caches.
>

(Putting aside the cases when the access is really impossible at all.)
Correct me if I'm wrong, but isn't that because we don't have a proper
hook for the importer to tell the DMA-buf framework to prepare the
buffer for its access?

> This is what we have already done in DMA-buf and what already works
> perfectly fine with use cases which are even more complicated than a
> simple write cache invalidation.
>
>  This is just a software solution which works because of coincident and
>  not because of engineering.
> >>> By mandating a software fallback for the cases where you would need
> >>> bracketed access to the dma-buf, you simply shift the problem into
> >>> userspace. Userspace then creates the bracket by falling back to some
> >>> other import option that mostly do a copy and then the appropriate
> >>> cache maintenance.
> >>>
> >>> While I understand your sentiment about the DMA-API design being
> >>> inconvenient when things are just coherent by system design, the DMA-
> >>> API design wasn't done this way due to bad engineering, but due to the
> >>> fact that performant 

Re: Try to address the DMA-buf coherency problem

2022-11-04 Thread Christian König

Am 04.11.22 um 15:58 schrieb Rob Clark:

On Wed, Nov 2, 2022 at 5:21 AM Christian König
 wrote:

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed
back on this practice already.

This only works because the Wayland as well as X display pipeline is
smart enough to insert an extra copy when it find that an imported
buffer can't be used as a framebuffer directly.


With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.

Well, this copy is what we are trying to avoid here. The codec should
pump the data into LMEM in the first place.


For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
dmabuf, which more or less bypasses dma-buf to get back the backing
GEM obj?


When the codec and scanout is on the same device, then yes.

But we already had cases where the codec was on the dGPU and scanout 
happened on the integrated engine.


Regards,
Christian.



BR,
-R




Re: Try to address the DMA-buf coherency problem

2022-11-04 Thread Rob Clark
On Wed, Nov 2, 2022 at 5:21 AM Christian König
 wrote:
>
> Hi Lucas,
>
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> >
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> >
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> >>> [SNIP]
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.
> >>
> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.
> >>
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
>
> Well, this copy is what we are trying to avoid here. The codec should
> pump the data into LMEM in the first place.
>

For the dGPU VRAM case, shouldn't this be amdgpu re-importing it's own
dmabuf, which more or less bypasses dma-buf to get back the backing
GEM obj?

BR,
-R


Re: Try to address the DMA-buf coherency problem

2022-11-04 Thread Christian König

Am 04.11.22 um 14:38 schrieb Nicolas Dufresne:

Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:

[SNIP]

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.

Well, that stuff is absolutely ready:
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175
What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
vendors in production. There is an upstream proposal for GStreamer, but review
comments were never addressed, in short, its stalled, and it waiting for a
volunteer. It might also be based on very old implementation of DMABuf Heap,
needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391


Well this is actually quite related to coherency.

As general purpose framework it's mandatory for DMA Heaps to support the 
coherency callbacks. And as far as I can see they are indeed correctly 
implemented.


So if you allocate from a DMA Heap it should be guaranteed that sharing 
with all devices work correctly, no matter if they write with the CPU or 
don't snoop the CPU cache. That's essentially what those heaps are good for.


I just don't know how you should select between the contiguous and the 
system heap since I'm not deep enough in the userspace side of this.


We should have come up with heaps earlier and don't implement so many 
different system memory allocators in the drivers in the first place.



DMA-buf has a lengthy section about CPU access to buffers and clearly
documents how all of that is supposed to work:
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160
This includes braketing of CPU access with dma_buf_begin_cpu_access()
and dma_buf_end_cpu_access(), as well as transaction management between
devices and the CPU and even implicit synchronization.

This specification is then implemented by the different drivers
including V4L2:
https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473

As well as the different DRM drivers:
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

I know, I've implement the userspace bracketing for this in GStreamer [1] before
DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
VB2. Was tested against i915 driver. Note, this is just a fallback, the
performance is terrible, memory exported by (at least my old i915 HW) is not
cacheable on CPU. Though, between corrupted image and bad performance or just
bad performance, we decided that it was better to have the second. When the
DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
works. Work is in progress to better handle these two cases generically. For
now, sometimes the app need to get involved, but this is only happens on
embedded/controlled kind of use cases. What matters is that application that
needs this can do it.

[1] 
https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c


Yeah, I've heard complains about that before with people asking if we 
couldn't map buffers exporter by the GPU as cacheable and then flush it 
on demand.


For i915 that's more or less possible because it just uses stolen system 
memory, but pretty much any other dGPU uses local memory and mapping a 
PCIe BAR cacheable is really a no-go.



This design was then used by us with various media players on different
customer projects, including QNAP https://www.qnap.com/en/product/ts-877
as well as the newest Tesla
https://www.amd.com/en/products/embedded-automotive-solutions

I won't go into the details here, but we are using exactly the approach
I've outlined to let userspace control the DMA between the different
device in question. I'm one of the main designers of that and our
multimedia and mesa team has up-streamed quite a number of changes for
this project.

I'm not that well into different ARM based solutions because we are just
recently getting results that this starts to work with AMD GPUs, but I'm
pretty sure that the design should be able to handle that as well.

So we have clearly prove that this design works, even with special
requirements which are way more complex than what we are discussing
here. We had cases where we used GStreamer to feed DMA-buf handles into
multiple devices with different format requirements and that seems to
work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework
is working for you too. 

Re: Try to address the DMA-buf coherency problem

2022-11-04 Thread Nicolas Dufresne
Le vendredi 04 novembre 2022 à 10:03 +0100, Christian König a écrit :
> Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:
> > [SNIP]
> > 
> > Was there APIs suggested to actually make it manageable by userland to 
> > allocate
> > from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
> > ready, no.
> 
> Well, that stuff is absolutely ready: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175
>  
> What do you think I'm talking about all the time?

I'm aware of DMA Heap, still have few gaps, but this unrelated to coherency (we
can discuss offline, with Daniel S.). For DMABuf Heap, its used in many forks by
vendors in production. There is an upstream proposal for GStreamer, but review
comments were never addressed, in short, its stalled, and it waiting for a
volunteer. It might also be based on very old implementation of DMABuf Heap,
needs to be verified in depth for sure as the time have passed.

https://gitlab.freedesktop.org/gstreamer/gstreamer/-/merge_requests/1391

> 
> DMA-buf has a lengthy section about CPU access to buffers and clearly 
> documents how all of that is supposed to work: 
> https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160
>  
> This includes braketing of CPU access with dma_buf_begin_cpu_access() 
> and dma_buf_end_cpu_access(), as well as transaction management between 
> devices and the CPU and even implicit synchronization.
> 
> This specification is then implemented by the different drivers 
> including V4L2: 
> https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473
> 
> As well as the different DRM drivers: 
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117
>  
> https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234

I know, I've implement the userspace bracketing for this in GStreamer [1] before
DMAbuf Heap was merged and was one of the reporter for the missing bracketing in
VB2. Was tested against i915 driver. Note, this is just a fallback, the
performance is terrible, memory exported by (at least my old i915 HW) is not
cacheable on CPU. Though, between corrupted image and bad performance or just
bad performance, we decided that it was better to have the second. When the
DMABuf is backed by CPU cacheable memory, peformance is great and CPU fallback
works. Work is in progress to better handle these two cases generically. For
now, sometimes the app need to get involved, but this is only happens on
embedded/controlled kind of use cases. What matters is that application that
needs this can do it.

[1] 
https://gitlab.freedesktop.org/gstreamer/gstreamer/-/blob/main/subprojects/gst-plugins-base/gst-libs/gst/allocators/gstdmabuf.c

> 
> This design was then used by us with various media players on different 
> customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
> as well as the newest Tesla 
> https://www.amd.com/en/products/embedded-automotive-solutions
> 
> I won't go into the details here, but we are using exactly the approach 
> I've outlined to let userspace control the DMA between the different 
> device in question. I'm one of the main designers of that and our 
> multimedia and mesa team has up-streamed quite a number of changes for 
> this project.
> 
> I'm not that well into different ARM based solutions because we are just 
> recently getting results that this starts to work with AMD GPUs, but I'm 
> pretty sure that the design should be able to handle that as well.
> 
> So we have clearly prove that this design works, even with special 
> requirements which are way more complex than what we are discussing 
> here. We had cases where we used GStreamer to feed DMA-buf handles into 
> multiple devices with different format requirements and that seems to 
> work fine.

Sounds like you have a love/hate relationship with GStreamer. Glad the framework
is working for you too. The framework have had bidirectional memory allocation
for over a decade, it also has context sharing for stacks like
D3D11,12/GL/Vulkan/CUDA etc. I strictly didn't understand what you were
complaining about. As a vendor, you can solve all this in your BSP. Though,
translating BSP patches into a generic upstream-able features is not as simple.
The solution that works for vendor is usually the most cost effective one. I'm
sure, Tesla or AMD Automotive are no exceptions.

> 
> -
> What is clearly a bug in the kernel is that we don't reject things which 
> won't work correctly and this is what this patch here addresses. What we 
> could talk about is backward compatibility for this patch, cause it 
> might look like it breaks things which previously used to work at least 
> partially.

I did read your code review (I don't class this discussion as a code review).
You where asked to address several issues, clearly a v2 is needed.

1. Rob Clark stated the coherency is 

Re: Try to address the DMA-buf coherency problem

2022-11-04 Thread Christian König

Am 03.11.22 um 23:16 schrieb Nicolas Dufresne:

[SNIP]

We already had numerous projects where we reported this practice as bugs
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.

Links ? Remember that I do read every single bugs and emails around GStreamer
project. I do maintain older and newer V4L2 support in there. I also did
contribute a lot to the mechanism GStreamer have in-place to reverse the
allocation. In fact, its implemented, the problem being that on generic Linux,
the receiver element, like the GL element and the display sink don't have any
API they can rely on to allocate memory. Thus, they don't implement what we call
the allocation offer in GStreamer term. Very often though, on other modern OS,
or APIs like VA, the memory offer is replaced by a context. So the allocation is
done from a "context" which is neither an importer or an exporter. This is
mostly found on MacOS and Windows.

Was there APIs suggested to actually make it manageable by userland to allocate
from the GPU? Yes, this what Linux Device Allocator idea is for. Is that API
ready, no.


Well, that stuff is absolutely ready: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/heaps/system_heap.c#L175 
What do you think I'm talking about all the time?


DMA-buf has a lengthy section about CPU access to buffers and clearly 
documents how all of that is supposed to work: 
https://elixir.bootlin.com/linux/latest/source/drivers/dma-buf/dma-buf.c#L1160 
This includes braketing of CPU access with dma_buf_begin_cpu_access() 
and dma_buf_end_cpu_access(), as well as transaction management between 
devices and the CPU and even implicit synchronization.


This specification is then implemented by the different drivers 
including V4L2: 
https://elixir.bootlin.com/linux/latest/source/drivers/media/common/videobuf2/videobuf2-dma-sg.c#L473


As well as the different DRM drivers: 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/i915/gem/i915_gem_dmabuf.c#L117 
https://elixir.bootlin.com/linux/latest/source/drivers/gpu/drm/amd/amdgpu/amdgpu_dma_buf.c#L234


This design was then used by us with various media players on different 
customer projects, including QNAP https://www.qnap.com/en/product/ts-877 
as well as the newest Tesla 
https://www.amd.com/en/products/embedded-automotive-solutions


I won't go into the details here, but we are using exactly the approach 
I've outlined to let userspace control the DMA between the different 
device in question. I'm one of the main designers of that and our 
multimedia and mesa team has up-streamed quite a number of changes for 
this project.


I'm not that well into different ARM based solutions because we are just 
recently getting results that this starts to work with AMD GPUs, but I'm 
pretty sure that the design should be able to handle that as well.


So we have clearly prove that this design works, even with special 
requirements which are way more complex than what we are discussing 
here. We had cases where we used GStreamer to feed DMA-buf handles into 
multiple devices with different format requirements and that seems to 
work fine.


-

But enough of this rant. As I wrote Lucas as well this doesn't help us 
any further in the technical discussion.


The only technical argument I have is that if some userspace 
applications fail to use the provided UAPI while others use it correctly 
then this is clearly not a good reason to change the UAPI, but rather an 
argument to change the applications.


If the application should be kept simple and device independent then 
allocating the buffer from the device independent DMA heaps would be 
enough as well. Cause that provider implements the necessary handling 
for dma_buf_begin_cpu_access() and dma_buf_end_cpu_access().


I'm a bit surprised that we are arguing about stuff like this because we 
spend a lot of effort trying to document this. Daniel gave me the job to 
fix  this documentation, but after reading through it multiple times now 
I can't seem to find where the design and the desired behavior is unclear.


What is clearly a bug in the kernel is that we don't reject things which 
won't work correctly and this is what this patch here addresses. What we 
could talk about is backward compatibility for this patch, cause it 
might look like it breaks things which previously used to work at least 
partially.


Regards,
Christian.


Re: Try to address the DMA-buf coherency problem

2022-11-03 Thread Nicolas Dufresne
Le mercredi 02 novembre 2022 à 12:18 +0100, Christian König a écrit :
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could 
> > not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.

That paragraph is a bit sloppy. By "you guys" you mean what exactly ? Normal
users will let V4L2 device allocate and write into their own memory (the device
fill it, not "you guys"). This is done like this simply because this is
guarantied to work with the V4L2 device. Most V4L2 device produces known by
userpsace pixel formats and layout, for which userspace know for sure it can
implement a GPU shader or software fallback for. I'm still to see one of these
format that cannot be efficiently imported into a modern GPU and converted using
shaders. I'm not entirely sure what/which GPU a dGPU is compared to a GPU btw.

In many cases, camera kind of V4L2 devices will have 1 producer for many
consumers. Consider your photo application, the streams will likely be capture
and displayed while being encoded by one of more CODEC, while being streamed to
a Machine Learning model for analyses. The software complexity to communicate
back the list of receiver devices and implementing all their non-standard way to
allocate memory, so all the combination of trial and error is just ridiculously
high. Remember that each GPU have their own allocation methods and corner cases,
this is simply not manageable by "you guys", which I pretty much assume is
everyone writing software for Generic Linux these days (non-Android/ChromeOS).

> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

This is a bit inaccurate. The compositor I've worked with (Gnome and Weston)
will only memcpy SHM. For DMABuf, they will fail importation if its not usable
either by the display or the GPU. Specially on the GPU side though (which is the
ultimate compositor fallback), there exists efficient HW copy mechanism that may
be used, and this is fine, since unlike your scannout example, it won't be
uploading over and over, but will do later re-display from a remote copy (or
transformed copy). Or if you prefer, its cached at the cost of higher memory
usage.

I think it would be preferable to speak about device to device sharing, since
V4L2 vs GPU is not really representative of the program. I think V4L2 vs GPU and
"you guys" simply contribute to the never ending, and needless friction around
that difficulty that exists with current support for memory sharing in Linux.

> 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of 
> > ways
> > to use it. Of course, in current state of DMABuf, you have to be an 
> > exporter to
> > do anything fancy, but it did not have to be like this, its a design 
> > choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is 
> > various
> > condition you may use the CODEC for which the optimal location will vary. 
> > As an
> > example, using the post processor or not, see my next comment for more 
> > details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is 

Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Lucas Stach
Am Mittwoch, dem 02.11.2022 um 20:13 +0100 schrieb Christian König:
> Am 02.11.22 um 18:10 schrieb Lucas Stach:
> > Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> > [SNIP]
> > > It would just be doing this for the importer and exactly that
> > > would be bad design because we then have handling for the display driver
> > > outside of the driver.
> > > 
> > The driver would have to do those cache maintenance operations if it
> > directly worked with a non-coherent device. Doing it for the importer
> > is just doing it for another device, not the one directly managed by
> > the exporter.
> > 
> > I really don't see the difference to the other dma-buf ops: in
> > dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
> > the address space of the importer. Why would cache maintenance be any
> > different?
> 
> The issue here is the explicit ownership transfer.
> 
> We intentionally decided against that because it breaks tons of use 
> cases and is at least by me and a couple of others seen as generally 
> design failure of the Linux DMA-API.
> 
> DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
> be able to directly decided where a certain operation should go. E.g. we 
> have cases where for example a P2P write doesn't even go to memory, but 
> rather a doorbell BAR to trigger another operation. Throwing in CPU 
> round trips for explicit ownership transfer completely breaks that concept.

I'm not sure where you see the CPU in that scenario. A explicit
ownership transfer could also trigger things like a cache invalidate on
the receiving end of a P2P transfer. A CPU cache invalidate/flush is
just one possible use-case of the ownership transfer.

And sure, some uses will just not work when a explicit ownership
transfer is required, as the usage model and/or API doesn't allow for
it. But lots of uses will work with explicit ownership transfer,
actually there is no other way to properly deal with systems where the
sharing masters may be non-coherent, which is norm rather than the
exception in lots of embedded uses, regardless of the inconvenience
this might cause for some other areas.

> 
> Additional to that a very basic concept of DMA-buf is that the exporter 
> provides the buffer as it is and just double checks if the importer can 
> access it. For example we have XGMI links which makes memory accessible 
> to other devices on the same bus, but not to PCIe device and not even to 
> the CPU. Otherwise you wouldn't be able to implement things like secure 
> decoding where the data isn't even accessible outside the device to 
> device link.
> 
> So if a device driver uses cached system memory on an architecture which 
> devices which can't access it the right approach is clearly to reject 
> the access.
> 
> What we can do is to reverse the role of the exporter and importer and 
> let the device which needs uncached memory take control. This way this 
> device can insert operations as needed, e.g. flush read caches or 
> invalidate write caches.
> 
And that will only solve one aspect of the allocation restriction maze,
namely the cacheable attribute, but it might as well break things like
contiguous requirements.

> This is what we have already done in DMA-buf and what already works 
> perfectly fine with use cases which are even more complicated than a 
> simple write cache invalidation.
> 
> > > > > This is just a software solution which works because of coincident and
> > > > > not because of engineering.
> > > > By mandating a software fallback for the cases where you would need
> > > > bracketed access to the dma-buf, you simply shift the problem into
> > > > userspace. Userspace then creates the bracket by falling back to some
> > > > other import option that mostly do a copy and then the appropriate
> > > > cache maintenance.
> > > > 
> > > > While I understand your sentiment about the DMA-API design being
> > > > inconvenient when things are just coherent by system design, the DMA-
> > > > API design wasn't done this way due to bad engineering, but due to the
> > > > fact that performant DMA access on some systems just require this kind
> > > > of bracketing.
> > > Well, this is exactly what I'm criticizing on the DMA-API. Instead of
> > > giving you a proper error code when something won't work in a specific
> > > way it just tries to hide the requirements inside the DMA layer.
> > > 
> > > For example when your device can only access 32bits the DMA-API
> > > transparently insert bounce buffers instead of giving you a proper error
> > > code that the memory in question can't be accessed.
> > > 
> > > This just tries to hide the underlying problem instead of pushing it
> > > into the upper layer where it can be handled much more gracefully.
> > How would you expect the DMA API to behave on a system where the device
> > driver is operating on cacheable memory, but the device is non-
> > coherent? Telling the driver that this just doesn't 

Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Christian König

Am 02.11.22 um 18:10 schrieb Lucas Stach:

Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
[SNIP]

It would just be doing this for the importer and exactly that
would be bad design because we then have handling for the display driver
outside of the driver.


The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.

I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?


The issue here is the explicit ownership transfer.

We intentionally decided against that because it breaks tons of use 
cases and is at least by me and a couple of others seen as generally 
design failure of the Linux DMA-API.


DMA-Buf let's the exporter setup the DMA addresses the importer uses to 
be able to directly decided where a certain operation should go. E.g. we 
have cases where for example a P2P write doesn't even go to memory, but 
rather a doorbell BAR to trigger another operation. Throwing in CPU 
round trips for explicit ownership transfer completely breaks that concept.


Additional to that a very basic concept of DMA-buf is that the exporter 
provides the buffer as it is and just double checks if the importer can 
access it. For example we have XGMI links which makes memory accessible 
to other devices on the same bus, but not to PCIe device and not even to 
the CPU. Otherwise you wouldn't be able to implement things like secure 
decoding where the data isn't even accessible outside the device to 
device link.


So if a device driver uses cached system memory on an architecture which 
devices which can't access it the right approach is clearly to reject 
the access.


What we can do is to reverse the role of the exporter and importer and 
let the device which needs uncached memory take control. This way this 
device can insert operations as needed, e.g. flush read caches or 
invalidate write caches.


This is what we have already done in DMA-buf and what already works 
perfectly fine with use cases which are even more complicated than a 
simple write cache invalidation.



This is just a software solution which works because of coincident and
not because of engineering.

By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.

Well, this is exactly what I'm criticizing on the DMA-API. Instead of
giving you a proper error code when something won't work in a specific
way it just tries to hide the requirements inside the DMA layer.

For example when your device can only access 32bits the DMA-API
transparently insert bounce buffers instead of giving you a proper error
code that the memory in question can't be accessed.

This just tries to hide the underlying problem instead of pushing it
into the upper layer where it can be handled much more gracefully.

How would you expect the DMA API to behave on a system where the device
driver is operating on cacheable memory, but the device is non-
coherent? Telling the driver that this just doesn't work?


Yes, exactly that.

It's the job of the higher level to prepare the buffer a device work 
with, not the one of the lower level.


In other words in a proper design the higher level would prepare the 
memory in a way the device driver can work with it, not the other way 
around.


When a device driver gets memory it can't work with the correct response 
is to throw an error and bubble that up into a layer where it can be 
handled gracefully.


For example instead of using bounce buffers under the hood the DMA layer 
the MM should make sure that when you call read() with O_DIRECT that the 
pages in question are accessible by the device.



It's a use-case that is working fine today with many devices (e.g. network
adapters) in the ARM world, exactly because the architecture specific
implementation of the DMA API inserts the cache maintenance operations
on buffer ownership transfer.


Yeah, I'm perfectly aware of that. The problem is that exactly that 
design totally breaks GPUs on Xen DOM0 for example.


And Xen is just one example, I can certainly say from experience that 
this design was a really really bad idea because it favors just one use 
case while making other use cases practically impossible if not really 
hard to 

Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Lucas Stach
Am Mittwoch, dem 02.11.2022 um 13:21 +0100 schrieb Christian König:
> Hi Lucas,
> 
> Am 02.11.22 um 12:39 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > going to reply in more detail when I have some more time, so just some
> > quick thoughts for now.
> > 
> > Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> > > Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > > > [SNIP]
> > > As far as I can see it you guys just allocate a buffer from a V4L2
> > > device, fill it with data and send it to Wayland for displaying.
> > > 
> > > To be honest I'm really surprised that the Wayland guys hasn't pushed
> > > back on this practice already.
> > > 
> > > This only works because the Wayland as well as X display pipeline is
> > > smart enough to insert an extra copy when it find that an imported
> > > buffer can't be used as a framebuffer directly.
> > > 
> > With bracketed access you could even make this case work, as the dGPU
> > would be able to slurp a copy of the dma-buf into LMEM for scanout.
> 
> Well, this copy is what we are trying to avoid here. The codec should 
> pump the data into LMEM in the first place.
> 
That's very use-case specific. If the codec used those frames as
reference frames for further coding operations, allocating them in LMEM
of a remote device doesn't sound like a clever idea. But then I think
this is a whole different discussion on its own.

> > > > The only case the commonly fails is whenever we try to display UVC
> > > > created dmabuf,
> > > Well, exactly that's not correct! The whole x86 use cases of direct
> > > display for dGPUs are broken because media players think they can do the
> > > simple thing and offload all the problematic cases to the display server.
> > > 
> > > This is absolutely *not* the common use case you describe here, but
> > > rather something completely special to ARM.
> > It the normal case for a lot of ARM SoCs.
> 
> Yeah, but it's not the normal case for everybody.
> 
> We had numerous projects where customers wanted to pump video data 
> directly from a decoder into an GPU frame or from a GPU frame into an 
> encoder.
> 
> The fact that media frameworks doesn't support that out of the box is 
> simply a bug.
> 
> > That world is certainly not
> > any less big than the x86 dGPU world. A huge number of devices are ARM
> > based set-top boxes and other video players. Just because it is a
> > special case for you doesn't mean it's a global special case.
> 
> Ok, let's stop with that. This isn't helpful in the technical discussion.
> 
Agreed. I think we should try to understand the fact that there are
very different world-views and use-cases for dma-buf right now and that
breaking one or the other is simply no option. What may be an obscure
special case on x86 may be very common on ARM and vice versa.

> > 
> > > That we haven't heard anybody screaming that x86 doesn't work is just
> > > because we handle the case that a buffer isn't directly displayable in
> > > X/Wayland anyway, but this is absolutely not the optimal solution.
> > > 
> > > The argument that you want to keep the allocation on the codec side is
> > > completely false as far as I can see.
> > > 
> > > We already had numerous projects where we reported this practice as bugs
> > > to the GStreamer and FFMPEG project because it won't work on x86 with 
> > > dGPUs.
> > > 
> > And on a lot of ARM SoCs it's exactly the right thing to do.
> 
> Yeah and that's fine, it just doesn't seem to work in all cases.
> 
> For both x86 as well as the case here that the CPU cache might be dirty 
> the exporter needs to be the device with the requirements.
> 
> For x86 dGPUs that's the backing store is some local memory. For the 
> non-coherent ARM devices it's that the CPU cache is not dirty.
> 
> For a device driver which solely works with cached system memory 
> inserting cache flush operations is something it would never do for 
> itself. 
> 
It's exactly what a device driver working with cacheable memory buffers
on a architecture with non-coherent DMA masters would do. In fact it's
the main reason why the bracketing with explicit ownership transfer
exists in DMA API: to allow the DMA API implementation to do the
necessary cache maintenance operations on architectures that need them.

> It would just be doing this for the importer and exactly that 
> would be bad design because we then have handling for the display driver 
> outside of the driver.
> 
The driver would have to do those cache maintenance operations if it
directly worked with a non-coherent device. Doing it for the importer
is just doing it for another device, not the one directly managed by
the exporter.

I really don't see the difference to the other dma-buf ops: in
dma_buf_map_attachment the exporter maps the dma-buf on behalf and into
the address space of the importer. Why would cache maintenance be any
different?

> > > This is just a software solution which works because of coincident and
> > > not because of 

Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Christian König

Am 02.11.22 um 13:50 schrieb Pekka Paalanen:

On Wed, 2 Nov 2022 13:27:18 +0100
Christian König  wrote:


Am 02.11.22 um 13:19 schrieb Pekka Paalanen:

On Wed, 2 Nov 2022 12:18:01 +0100
Christian König  wrote:
  

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

[SNIP]

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

Well exactly that's the point I'm raising: The client *must* understand
that!

See we need to be able to handle all restrictions here, coherency of the
data is just one of them.

For example the much more important question is the location of the data
and for this allocating from the V4L2 device is in most cases just not
going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around.

And exactly that's my point. You always need to look at both ways to
share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed
back on this practice already.

What should we Wayland people be pushing back on exactly? And where is
our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the
Wayland compositor cannot use the given dmabuf at all, giving the
client an opportunity to try something else.

That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for
a client/app:

a) you do the right thing and wait for the compositor to ack that the
dmabuf works (the reply should come pretty much immediately, not
needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it
doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it
doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but
there it is, added for Mesa IIRC.


Well I'm not sure if those projects actually used X or Wayland, but we 
did had cases where this was really used.


IIRC in the case of Prime offloading we now allocate a buffer from the 
displaying device in Mesa and do the copy from the rendering GPU to the 
displaying GPU on the client side.


The background is that this gave us much better control which engine and 
parameters where used for the copy resulting in a nice performance 
improvement.


Regards,
Christian.


Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Pekka Paalanen
On Wed, 2 Nov 2022 13:27:18 +0100
Christian König  wrote:

> Am 02.11.22 um 13:19 schrieb Pekka Paalanen:
> > On Wed, 2 Nov 2022 12:18:01 +0100
> > Christian König  wrote:
> >  
> >> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:  
> >>> [SNIP]  
> > But the client is just a video player. It doesn't understand how to
> > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > response to the client.  
>  Well exactly that's the point I'm raising: The client *must* understand
>  that!
> 
>  See we need to be able to handle all restrictions here, coherency of the
>  data is just one of them.
> 
>  For example the much more important question is the location of the data
>  and for this allocating from the V4L2 device is in most cases just not
>  going to fly.  
> >>> It feels like this is a generic statement and there is no reason it could 
> >>> not be
> >>> the other way around.  
> >> And exactly that's my point. You always need to look at both ways to
> >> share the buffer and can't assume that one will always work.
> >>
> >> As far as I can see it you guys just allocate a buffer from a V4L2
> >> device, fill it with data and send it to Wayland for displaying.
> >>
> >> To be honest I'm really surprised that the Wayland guys hasn't pushed
> >> back on this practice already.  
> > What should we Wayland people be pushing back on exactly? And where is
> > our authority and opportunity to do so?
> >
> > The Wayland protocol dmabuf extension allows a graceful failure if the
> > Wayland compositor cannot use the given dmabuf at all, giving the
> > client an opportunity to try something else.  
> 
> That's exactly what I meant with pushing back :)
> 
> I wasn't aware that this handling is already implemented.

Well... it's "optional". The Wayland dmabuf protocol has two cases for
a client/app:

a) you do the right thing and wait for the compositor to ack that the
   dmabuf works (the reply should come pretty much immediately, not
   needing the compositor to actually composite), or

b) you just send the dmabuf and continue as if it always worked. If it
   doesn't, you might get a protocol error later and be disconnected.

Guess which one Mesa uses?

I've been told Mesa has no way to handle a failure there, so it
doesn't. I would not be surprised if others just copy that behaviour.

I recall personally being against adding option b) to begin with, but
there it is, added for Mesa IIRC.


> > The Wayland protocol also
> > tells clients which DRM rendering device at minimum the dmabuf needs to
> > be compatible with. It even tells which KMS device the dmabuf could be
> > put on direct scanout if the dmabuf was suitable for that and direct
> > scanout is otherwise possible.  
> 
> Yeah, perfect. Exactly that's what's needed here.
> 
> > What the client (application) does with all that information is up to
> > the client. That code is not part of Wayland.
> >
> > I'm sure we would be happy to add protocol for anything that
> > https://github.com/cubanismo/allocator needs to become the universal
> > optimal buffer allocator library.  
> 
>  From what you wrote it's already perfectly covered.

Cool. Now if clients would just use it...


Thanks,
pq

> >> This only works because the Wayland as well as X display pipeline is
> >> smart enough to insert an extra copy when it find that an imported
> >> buffer can't be used as a framebuffer directly.  
> > The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
> > rendering for composition if direct scanout on a KMS plane does not
> > work. There are many reasons why direct scanout may not be possible in
> > addition to hardware and drivers not agreeing to do it with the given
> > set of buffers.
> >
> > A general purpose (read: desktop) Wayland compositor simply cannot live
> > without being able to fall back from KMS composition to software/GPU
> > composition.
> >
> > But yes, there are use cases where that fallback is as good as failing
> > completely. Those are not desktops but more like set-top-boxes and TVs.  
> 
> Completely agree to this approach.
> 
> The only problem is that media players tend to not implement a way to 
> allow direct scanout because of those fallback paths.
> 
> But as you said that's their decision.
> 
> Thanks,
> Christian.
> 
> >
> >
> > Thanks,
> > pq  
> 



pgp8suniBcqhM.pgp
Description: OpenPGP digital signature


Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Christian König

Am 02.11.22 um 13:19 schrieb Pekka Paalanen:

On Wed, 2 Nov 2022 12:18:01 +0100
Christian König  wrote:


Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

[SNIP]

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

Well exactly that's the point I'm raising: The client *must* understand
that!

See we need to be able to handle all restrictions here, coherency of the
data is just one of them.

For example the much more important question is the location of the data
and for this allocating from the V4L2 device is in most cases just not
going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around.

And exactly that's my point. You always need to look at both ways to
share the buffer and can't assume that one will always work.

As far as I can see it you guys just allocate a buffer from a V4L2
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed
back on this practice already.

What should we Wayland people be pushing back on exactly? And where is
our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the
Wayland compositor cannot use the given dmabuf at all, giving the
client an opportunity to try something else.


That's exactly what I meant with pushing back :)

I wasn't aware that this handling is already implemented.


The Wayland protocol also
tells clients which DRM rendering device at minimum the dmabuf needs to
be compatible with. It even tells which KMS device the dmabuf could be
put on direct scanout if the dmabuf was suitable for that and direct
scanout is otherwise possible.


Yeah, perfect. Exactly that's what's needed here.


What the client (application) does with all that information is up to
the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that
https://github.com/cubanismo/allocator needs to become the universal
optimal buffer allocator library.


From what you wrote it's already perfectly covered.


This only works because the Wayland as well as X display pipeline is
smart enough to insert an extra copy when it find that an imported
buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
rendering for composition if direct scanout on a KMS plane does not
work. There are many reasons why direct scanout may not be possible in
addition to hardware and drivers not agreeing to do it with the given
set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live
without being able to fall back from KMS composition to software/GPU
composition.

But yes, there are use cases where that fallback is as good as failing
completely. Those are not desktops but more like set-top-boxes and TVs.


Completely agree to this approach.

The only problem is that media players tend to not implement a way to 
allow direct scanout because of those fallback paths.


But as you said that's their decision.

Thanks,
Christian.




Thanks,
pq




Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Christian König

Hi Lucas,

Am 02.11.22 um 12:39 schrieb Lucas Stach:

Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

[SNIP]

As far as I can see it you guys just allocate a buffer from a V4L2
device, fill it with data and send it to Wayland for displaying.

To be honest I'm really surprised that the Wayland guys hasn't pushed
back on this practice already.

This only works because the Wayland as well as X display pipeline is
smart enough to insert an extra copy when it find that an imported
buffer can't be used as a framebuffer directly.


With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.


Well, this copy is what we are trying to avoid here. The codec should 
pump the data into LMEM in the first place.



The only case the commonly fails is whenever we try to display UVC
created dmabuf,

Well, exactly that's not correct! The whole x86 use cases of direct
display for dGPUs are broken because media players think they can do the
simple thing and offload all the problematic cases to the display server.

This is absolutely *not* the common use case you describe here, but
rather something completely special to ARM.

It the normal case for a lot of ARM SoCs.


Yeah, but it's not the normal case for everybody.

We had numerous projects where customers wanted to pump video data 
directly from a decoder into an GPU frame or from a GPU frame into an 
encoder.


The fact that media frameworks doesn't support that out of the box is 
simply a bug.



That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.


Ok, let's stop with that. This isn't helpful in the technical discussion.




That we haven't heard anybody screaming that x86 doesn't work is just
because we handle the case that a buffer isn't directly displayable in
X/Wayland anyway, but this is absolutely not the optimal solution.

The argument that you want to keep the allocation on the codec side is
completely false as far as I can see.

We already had numerous projects where we reported this practice as bugs
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.


And on a lot of ARM SoCs it's exactly the right thing to do.


Yeah and that's fine, it just doesn't seem to work in all cases.

For both x86 as well as the case here that the CPU cache might be dirty 
the exporter needs to be the device with the requirements.


For x86 dGPUs that's the backing store is some local memory. For the 
non-coherent ARM devices it's that the CPU cache is not dirty.


For a device driver which solely works with cached system memory 
inserting cache flush operations is something it would never do for 
itself. It would just be doing this for the importer and exactly that 
would be bad design because we then have handling for the display driver 
outside of the driver.



This is just a software solution which works because of coincident and
not because of engineering.

By mandating a software fallback for the cases where you would need
bracketed access to the dma-buf, you simply shift the problem into
userspace. Userspace then creates the bracket by falling back to some
other import option that mostly do a copy and then the appropriate
cache maintenance.

While I understand your sentiment about the DMA-API design being
inconvenient when things are just coherent by system design, the DMA-
API design wasn't done this way due to bad engineering, but due to the
fact that performant DMA access on some systems just require this kind
of bracketing.


Well, this is exactly what I'm criticizing on the DMA-API. Instead of 
giving you a proper error code when something won't work in a specific 
way it just tries to hide the requirements inside the DMA layer.


For example when your device can only access 32bits the DMA-API 
transparently insert bounce buffers instead of giving you a proper error 
code that the memory in question can't be accessed.


This just tries to hide the underlying problem instead of pushing it 
into the upper layer where it can be handled much more gracefully.


Regards,
Christian.



Regards,
Lucas





Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Pekka Paalanen
On Wed, 2 Nov 2022 12:18:01 +0100
Christian König  wrote:

> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]  
> >>> But the client is just a video player. It doesn't understand how to
> >>> allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> >>> allocator (again ...), 'just allocate on the GPU' isn't a useful
> >>> response to the client.  
> >> Well exactly that's the point I'm raising: The client *must* understand
> >> that!
> >>
> >> See we need to be able to handle all restrictions here, coherency of the
> >> data is just one of them.
> >>
> >> For example the much more important question is the location of the data
> >> and for this allocating from the V4L2 device is in most cases just not
> >> going to fly.  
> > It feels like this is a generic statement and there is no reason it could 
> > not be
> > the other way around.  
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.

What should we Wayland people be pushing back on exactly? And where is
our authority and opportunity to do so?

The Wayland protocol dmabuf extension allows a graceful failure if the
Wayland compositor cannot use the given dmabuf at all, giving the
client an opportunity to try something else. The Wayland protocol also
tells clients which DRM rendering device at minimum the dmabuf needs to
be compatible with. It even tells which KMS device the dmabuf could be
put on direct scanout if the dmabuf was suitable for that and direct
scanout is otherwise possible.

What the client (application) does with all that information is up to
the client. That code is not part of Wayland.

I'm sure we would be happy to add protocol for anything that
https://github.com/cubanismo/allocator needs to become the universal
optimal buffer allocator library.


> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.

The only fallback Wayland compositors tend to do is use OpenGL/Vulkan
rendering for composition if direct scanout on a KMS plane does not
work. There are many reasons why direct scanout may not be possible in
addition to hardware and drivers not agreeing to do it with the given
set of buffers.

A general purpose (read: desktop) Wayland compositor simply cannot live
without being able to fall back from KMS composition to software/GPU
composition.

But yes, there are use cases where that fallback is as good as failing
completely. Those are not desktops but more like set-top-boxes and TVs.


Thanks,
pq


pgpqjZudsF0M0.pgp
Description: OpenPGP digital signature


Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Lucas Stach
Hi Christian,

going to reply in more detail when I have some more time, so just some
quick thoughts for now.

Am Mittwoch, dem 02.11.2022 um 12:18 +0100 schrieb Christian König:
> Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:
> > [SNIP]
> > > > But the client is just a video player. It doesn't understand how to
> > > > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > > > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > > > response to the client.
> > > Well exactly that's the point I'm raising: The client *must* understand
> > > that!
> > > 
> > > See we need to be able to handle all restrictions here, coherency of the
> > > data is just one of them.
> > > 
> > > For example the much more important question is the location of the data
> > > and for this allocating from the V4L2 device is in most cases just not
> > > going to fly.
> > It feels like this is a generic statement and there is no reason it could 
> > not be
> > the other way around.
> 
> And exactly that's my point. You always need to look at both ways to 
> share the buffer and can't assume that one will always work.
> 
> As far as I can see it you guys just allocate a buffer from a V4L2 
> device, fill it with data and send it to Wayland for displaying.
> 
> To be honest I'm really surprised that the Wayland guys hasn't pushed 
> back on this practice already.
> 
> This only works because the Wayland as well as X display pipeline is 
> smart enough to insert an extra copy when it find that an imported 
> buffer can't be used as a framebuffer directly.
> 
With bracketed access you could even make this case work, as the dGPU
would be able to slurp a copy of the dma-buf into LMEM for scanout.
 
> >   I have colleague who integrated PCIe CODEC (Blaize Xplorer
> > X1600P PCIe Accelerator) hosting their own RAM. There was large amount of 
> > ways
> > to use it. Of course, in current state of DMABuf, you have to be an 
> > exporter to
> > do anything fancy, but it did not have to be like this, its a design 
> > choice. I'm
> > not sure in the end what was the final method used, the driver isn't yet
> > upstream, so maybe that is not even final. What I know is that there is 
> > various
> > condition you may use the CODEC for which the optimal location will vary. 
> > As an
> > example, using the post processor or not, see my next comment for more 
> > details.
> 
> Yeah, and stuff like this was already discussed multiple times. Local 
> memory of devices can only be made available by the exporter, not the 
> importer.
> 
> So in the case of separated camera and encoder you run into exactly the 
> same limitation that some device needs the allocation to happen on the 
> camera while others need it on the encoder.
> 
> > > The more common case is that you need to allocate from the GPU and then
> > > import that into the V4L2 device. The background is that all dGPUs I
> > > know of need the data inside local memory (VRAM) to be able to scan out
> > > from it.
> > The reality is that what is common to you, might not be to others. In my 
> > work,
> > most ARM SoC have display that just handle direct scannout from cameras and
> > codecs.
> 
> > The only case the commonly fails is whenever we try to display UVC
> > created dmabuf,
> 
> Well, exactly that's not correct! The whole x86 use cases of direct 
> display for dGPUs are broken because media players think they can do the 
> simple thing and offload all the problematic cases to the display server.
> 
> This is absolutely *not* the common use case you describe here, but 
> rather something completely special to ARM.

It the normal case for a lot of ARM SoCs. That world is certainly not
any less big than the x86 dGPU world. A huge number of devices are ARM
based set-top boxes and other video players. Just because it is a
special case for you doesn't mean it's a global special case.

> 
> >   which have dirty CPU write cache and this is the type of thing
> > we'd like to see solved. I think this series was addressing it in 
> > principle, but
> > failing the import and the raised point is that this wasn't the optimal way.
> > 
> > There is a community project called LibreELEC, if you aren't aware, they run
> > Khodi with direct scanout of video stream on a wide variety of SoC and they 
> > use
> > the CODEC as exporter all the time. They simply don't have cases were the
> > opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
> > does not really offer you any API to reverse the allocation.
> 
> Ok, let me try to explain it once more. It sounds like I wasn't able to 
> get my point through.
> 
> That we haven't heard anybody screaming that x86 doesn't work is just 
> because we handle the case that a buffer isn't directly displayable in 
> X/Wayland anyway, but this is absolutely not the optimal solution.
> 
> The argument that you want to keep the allocation on the codec side is 
> completely false as far as I can see.
> 
> We already 

Re: Try to address the DMA-buf coherency problem

2022-11-02 Thread Christian König

Am 01.11.22 um 22:09 schrieb Nicolas Dufresne:

[SNIP]

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

Well exactly that's the point I'm raising: The client *must* understand
that!

See we need to be able to handle all restrictions here, coherency of the
data is just one of them.

For example the much more important question is the location of the data
and for this allocating from the V4L2 device is in most cases just not
going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around.


And exactly that's my point. You always need to look at both ways to 
share the buffer and can't assume that one will always work.


As far as I can see it you guys just allocate a buffer from a V4L2 
device, fill it with data and send it to Wayland for displaying.


To be honest I'm really surprised that the Wayland guys hasn't pushed 
back on this practice already.


This only works because the Wayland as well as X display pipeline is 
smart enough to insert an extra copy when it find that an imported 
buffer can't be used as a framebuffer directly.



  I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the CODEC for which the optimal location will vary. As an
example, using the post processor or not, see my next comment for more details.


Yeah, and stuff like this was already discussed multiple times. Local 
memory of devices can only be made available by the exporter, not the 
importer.


So in the case of separated camera and encoder you run into exactly the 
same limitation that some device needs the allocation to happen on the 
camera while others need it on the encoder.



The more common case is that you need to allocate from the GPU and then
import that into the V4L2 device. The background is that all dGPUs I
know of need the data inside local memory (VRAM) to be able to scan out
from it.

The reality is that what is common to you, might not be to others. In my work,
most ARM SoC have display that just handle direct scannout from cameras and
codecs.



The only case the commonly fails is whenever we try to display UVC
created dmabuf,


Well, exactly that's not correct! The whole x86 use cases of direct 
display for dGPUs are broken because media players think they can do the 
simple thing and offload all the problematic cases to the display server.


This is absolutely *not* the common use case you describe here, but 
rather something completely special to ARM.



  which have dirty CPU write cache and this is the type of thing
we'd like to see solved. I think this series was addressing it in principle, but
failing the import and the raised point is that this wasn't the optimal way.

There is a community project called LibreELEC, if you aren't aware, they run
Khodi with direct scanout of video stream on a wide variety of SoC and they use
the CODEC as exporter all the time. They simply don't have cases were the
opposite is needed (or any kind of remote RAM to deal with). In fact, FFMPEG
does not really offer you any API to reverse the allocation.


Ok, let me try to explain it once more. It sounds like I wasn't able to 
get my point through.


That we haven't heard anybody screaming that x86 doesn't work is just 
because we handle the case that a buffer isn't directly displayable in 
X/Wayland anyway, but this is absolutely not the optimal solution.


The argument that you want to keep the allocation on the codec side is 
completely false as far as I can see.


We already had numerous projects where we reported this practice as bugs 
to the GStreamer and FFMPEG project because it won't work on x86 with dGPUs.


This is just a software solution which works because of coincident and 
not because of engineering.


Regards,
Christian.




Re: Try to address the DMA-buf coherency problem

2022-11-01 Thread Nicolas Dufresne
Le mardi 01 novembre 2022 à 18:40 +0100, Christian König a écrit :
> Am 28.10.22 um 20:47 schrieb Daniel Stone:
> > Hi Christian,
> > 
> > On Fri, 28 Oct 2022 at 18:50, Christian König
> >  wrote:
> > > Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > > > Though, its not generically possible to reverse these roles. If you 
> > > > want to do
> > > > so, you endup having to do like Android (gralloc) and ChromeOS 
> > > > (minigbm),
> > > > because you will have to allocate DRM buffers that knows about importer 
> > > > specific
> > > > requirements. See link [1] for what it looks like for RK3399, with 
> > > > Motion Vector
> > > > size calculation copied from the kernel driver into a userspace lib 
> > > > (arguably
> > > > that was available from V4L2 sizeimage, but this is technically 
> > > > difficult to
> > > > communicate within the software layers). If you could let the decoder 
> > > > export
> > > > (with proper cache management) the non-generic code would not be needed.
> > > Yeah, but I can also reverse the argument:
> > > 
> > > Getting the parameters for V4L right so that we can share the image is
> > > tricky, but getting the parameters so that the stuff is actually
> > > directly displayable by GPUs is even trickier.
> > > 
> > > Essentially you need to look at both sides and interference to get to a
> > > common ground, e.g. alignment, pitch, width/height, padding, etc.
> > > 
> > > Deciding from which side to allocate from is just one step in this
> > > process. For example most dGPUs can't display directly from system
> > > memory altogether, but it is possible to allocate the DMA-buf through
> > > the GPU driver and then write into device memory with P2P PCI transfers.
> > > 
> > > So as far as I can see switching importer and exporter roles and even
> > > having performant extra fallbacks should be a standard feature of 
> > > userspace.
> > > 
> > > > Another case where reversing the role is difficult is for case where 
> > > > you need to
> > > > multiplex the streams (let's use a camera to illustrate) and share that 
> > > > with
> > > > multiple processes. In these uses case, the DRM importers are volatile, 
> > > > which
> > > > one do you abuse to do allocation from ? In multimedia server like 
> > > > PipeWire, you
> > > > are not really aware if the camera will be used by DRM or not, and if 
> > > > something
> > > > "special" is needed in term of role inversion. It is relatively easy to 
> > > > deal
> > > > with matching modifiers, but using downstream (display/gpu) as an 
> > > > exporter is
> > > > always difficult (and require some level of abuse and guessing).
> > > Oh, very good point! Yeah we do have use cases for this where an input
> > > buffer is both displayed as well as encoded.
> > This is the main issue, yeah.
> > 
> > For a standard media player, they would try to allocate through V4L2
> > and decode through that into locally-allocated buffers. All they know
> > is that there's a Wayland server at the other end of a socket
> > somewhere which will want to import the FD. The server does give you
> > some hints along the way: it will tell you that importing into a
> > particular GPU target device is necessary as the ultimate fallback,
> > and importing into a particular KMS device is preferable as the
> > optimal path to hit an overlay.
> > 
> > So let's say that the V4L2 client does what you're proposing: it
> > allocates a buffer chain, schedules a decode into that buffer, and
> > passes it along to the server to import. The server fails to import
> > the buffer into the GPU, and tells the client this. The client then
> > ... well, it doesn't know that it needs to allocate within the GPU
> > instead, but it knows that doing so might be one thing which would
> > make the request succeed.
> > 
> > But the client is just a video player. It doesn't understand how to
> > allocate BOs for Panfrost or AMD or etnaviv. So without a universal
> > allocator (again ...), 'just allocate on the GPU' isn't a useful
> > response to the client.
> 
> Well exactly that's the point I'm raising: The client *must* understand 
> that!
> 
> See we need to be able to handle all restrictions here, coherency of the 
> data is just one of them.
> 
> For example the much more important question is the location of the data 
> and for this allocating from the V4L2 device is in most cases just not 
> going to fly.

It feels like this is a generic statement and there is no reason it could not be
the other way around. I have colleague who integrated PCIe CODEC (Blaize Xplorer
X1600P PCIe Accelerator) hosting their own RAM. There was large amount of ways
to use it. Of course, in current state of DMABuf, you have to be an exporter to
do anything fancy, but it did not have to be like this, its a design choice. I'm
not sure in the end what was the final method used, the driver isn't yet
upstream, so maybe that is not even final. What I know is that there is various
condition you may use the 

Re: Try to address the DMA-buf coherency problem

2022-11-01 Thread Christian König

Am 28.10.22 um 20:47 schrieb Daniel Stone:

Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König
 wrote:

Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.

Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is
tricky, but getting the parameters so that the stuff is actually
directly displayable by GPUs is even trickier.

Essentially you need to look at both sides and interference to get to a
common ground, e.g. alignment, pitch, width/height, padding, etc.

Deciding from which side to allocate from is just one step in this
process. For example most dGPUs can't display directly from system
memory altogether, but it is possible to allocate the DMA-buf through
the GPU driver and then write into device memory with P2P PCI transfers.

So as far as I can see switching importer and exporter roles and even
having performant extra fallbacks should be a standard feature of userspace.


Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).

Oh, very good point! Yeah we do have use cases for this where an input
buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2
and decode through that into locally-allocated buffers. All they know
is that there's a Wayland server at the other end of a socket
somewhere which will want to import the FD. The server does give you
some hints along the way: it will tell you that importing into a
particular GPU target device is necessary as the ultimate fallback,
and importing into a particular KMS device is preferable as the
optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it
allocates a buffer chain, schedules a decode into that buffer, and
passes it along to the server to import. The server fails to import
the buffer into the GPU, and tells the client this. The client then
... well, it doesn't know that it needs to allocate within the GPU
instead, but it knows that doing so might be one thing which would
make the request succeed.

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.


Well exactly that's the point I'm raising: The client *must* understand 
that!


See we need to be able to handle all restrictions here, coherency of the 
data is just one of them.


For example the much more important question is the location of the data 
and for this allocating from the V4L2 device is in most cases just not 
going to fly.


The more common case is that you need to allocate from the GPU and then 
import that into the V4L2 device. The background is that all dGPUs I 
know of need the data inside local memory (VRAM) to be able to scan out 
from it.



I fully understand your point about APIs like Vulkan not sensibly
allowing bracketing, and that's fine. On the other hand, a lot of
extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
Arm just cannot fulfill complete coherency. On a lot of these
platforms, despite what you might think about the CPU/GPU
capabilities, the bottleneck is _always_ memory bandwidth, so
mandating extra copies is an absolute non-starter, and would instantly
cripple billions of devices. Lucas has been pretty gentle, but to be
more clear, this is not an option and won't be for at least the next
decade.


Well x86 pretty much has the same restrictions.

For example the scanout buffer is usually always in local memory because 
you often scan out at up to 120Hz while your recording is only 30fps and 
most of the time lower resolution.


Pumping all that data 120 time a second over the PCIe bus would just not 
be doable in 

Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Daniel Stone
Hi Christian,

On Fri, 28 Oct 2022 at 18:50, Christian König
 wrote:
> Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:
> > Though, its not generically possible to reverse these roles. If you want to 
> > do
> > so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
> > because you will have to allocate DRM buffers that knows about importer 
> > specific
> > requirements. See link [1] for what it looks like for RK3399, with Motion 
> > Vector
> > size calculation copied from the kernel driver into a userspace lib 
> > (arguably
> > that was available from V4L2 sizeimage, but this is technically difficult to
> > communicate within the software layers). If you could let the decoder export
> > (with proper cache management) the non-generic code would not be needed.
>
> Yeah, but I can also reverse the argument:
>
> Getting the parameters for V4L right so that we can share the image is
> tricky, but getting the parameters so that the stuff is actually
> directly displayable by GPUs is even trickier.
>
> Essentially you need to look at both sides and interference to get to a
> common ground, e.g. alignment, pitch, width/height, padding, etc.
>
> Deciding from which side to allocate from is just one step in this
> process. For example most dGPUs can't display directly from system
> memory altogether, but it is possible to allocate the DMA-buf through
> the GPU driver and then write into device memory with P2P PCI transfers.
>
> So as far as I can see switching importer and exporter roles and even
> having performant extra fallbacks should be a standard feature of userspace.
>
> > Another case where reversing the role is difficult is for case where you 
> > need to
> > multiplex the streams (let's use a camera to illustrate) and share that with
> > multiple processes. In these uses case, the DRM importers are volatile, 
> > which
> > one do you abuse to do allocation from ? In multimedia server like 
> > PipeWire, you
> > are not really aware if the camera will be used by DRM or not, and if 
> > something
> > "special" is needed in term of role inversion. It is relatively easy to deal
> > with matching modifiers, but using downstream (display/gpu) as an exporter 
> > is
> > always difficult (and require some level of abuse and guessing).
>
> Oh, very good point! Yeah we do have use cases for this where an input
> buffer is both displayed as well as encoded.

This is the main issue, yeah.

For a standard media player, they would try to allocate through V4L2
and decode through that into locally-allocated buffers. All they know
is that there's a Wayland server at the other end of a socket
somewhere which will want to import the FD. The server does give you
some hints along the way: it will tell you that importing into a
particular GPU target device is necessary as the ultimate fallback,
and importing into a particular KMS device is preferable as the
optimal path to hit an overlay.

So let's say that the V4L2 client does what you're proposing: it
allocates a buffer chain, schedules a decode into that buffer, and
passes it along to the server to import. The server fails to import
the buffer into the GPU, and tells the client this. The client then
... well, it doesn't know that it needs to allocate within the GPU
instead, but it knows that doing so might be one thing which would
make the request succeed.

But the client is just a video player. It doesn't understand how to
allocate BOs for Panfrost or AMD or etnaviv. So without a universal
allocator (again ...), 'just allocate on the GPU' isn't a useful
response to the client.

I fully understand your point about APIs like Vulkan not sensibly
allowing bracketing, and that's fine. On the other hand, a lot of
extant usecases (camera/codec -> GPU/display, GPU -> codec, etc) on
Arm just cannot fulfill complete coherency. On a lot of these
platforms, despite what you might think about the CPU/GPU
capabilities, the bottleneck is _always_ memory bandwidth, so
mandating extra copies is an absolute non-starter, and would instantly
cripple billions of devices. Lucas has been pretty gentle, but to be
more clear, this is not an option and won't be for at least the next
decade.

So we obviously need a third way at this point, because 'all devices
must always be coherent' vs. 'cache must be an unknown' can't work.
How about this as a suggestion: we have some unused flags in the PRIME
ioctls. Can we add a flag for 'import must be coherent'?

That flag wouldn't be set for the existing ecosystem
Lucas/Nicolas/myself are talking about, where we have explicit
handover points and users are fully able to perform cache maintenance.
For newer APIs where it's not possible to properly express that
bracketing, they would always set that flag (unless we add an API
carve-out where the client promises to do whatever is required to
maintain that).

Would that be viable?

Cheers,
Daniel


Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Christian König

Hi Nicolas,

Am 28.10.22 um 17:46 schrieb Nicolas Dufresne:

Hi,

just dropping some real live use case, sorry I'm not really proposing solutions,
I believe you are much more knowledgeable in this regard.


Well, I think each of us has a lot of specialized knowledge. For example 
I don't know to much about gralloc/minigbm. So this input is very valuable.



Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :

Am 28.10.22 um 13:42 schrieb Lucas Stach:

Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

But essentially the right thing to do. The only alternative I can see is
to reverse the role of exporter and importer.


I don't think that would work generally either, as buffer exporter and
importer isn't always a 1:1 thing. As soon as any attached importer has
a different coherency behavior than the others, things fall apart.

I've just mentioned it because somebody noted that when you reverse the
roles of exporter and importer with the V4L driver and i915 then the use
case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.


Yeah, but I can also reverse the argument:

Getting the parameters for V4L right so that we can share the image is 
tricky, but getting the parameters so that the stuff is actually 
directly displayable by GPUs is even trickier.


Essentially you need to look at both sides and interference to get to a 
common ground, e.g. alignment, pitch, width/height, padding, etc.


Deciding from which side to allocate from is just one step in this 
process. For example most dGPUs can't display directly from system 
memory altogether, but it is possible to allocate the DMA-buf through 
the GPU driver and then write into device memory with P2P PCI transfers.


So as far as I can see switching importer and exporter roles and even 
having performant extra fallbacks should be a standard feature of userspace.



Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).


Oh, very good point! Yeah we do have use cases for this where an input 
buffer is both displayed as well as encoded.





Well, no. What I mean with coherency is that the devices don't need
insert special operation to access each others data.

This can be archived by multiple approaches, e.g. by the PCI coherency
requirements, device internal connections (XGMI, NVLink, CXL etc...) as
well as using uncached system memory.

The key point is what we certainly don't want is special operations
which say: Ok, now device A can access the data, now device B.
because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case
with multiple receiver. In some case, data may endup on CPU while being encoded
in HW. Current approach of disabling cache does work, but CPU algorithm truly
suffer in performance. Doing a full memcpy to a cached buffer helps, but remains
slower then if the cache had been snooped by the importer (encoder here) driver.


Yeah, that was another reason why we ended up rather having an extra 
copy than working with uncached buffers for display as well.


Regards,
Christian.


Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Nicolas Dufresne
Hi,

just dropping some real live use case, sorry I'm not really proposing solutions,
I believe you are much more knowledgeable in this regard.

Le vendredi 28 octobre 2022 à 16:26 +0200, Christian König a écrit :
> Am 28.10.22 um 13:42 schrieb Lucas Stach:
> > Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
> > > But essentially the right thing to do. The only alternative I can see is
> > > to reverse the role of exporter and importer.
> > > 
> > I don't think that would work generally either, as buffer exporter and
> > importer isn't always a 1:1 thing. As soon as any attached importer has
> > a different coherency behavior than the others, things fall apart.
> 
> I've just mentioned it because somebody noted that when you reverse the 
> roles of exporter and importer with the V4L driver and i915 then the use 
> case suddenly starts working.

Though, its not generically possible to reverse these roles. If you want to do
so, you endup having to do like Android (gralloc) and ChromeOS (minigbm),
because you will have to allocate DRM buffers that knows about importer specific
requirements. See link [1] for what it looks like for RK3399, with Motion Vector
size calculation copied from the kernel driver into a userspace lib (arguably
that was available from V4L2 sizeimage, but this is technically difficult to
communicate within the software layers). If you could let the decoder export
(with proper cache management) the non-generic code would not be needed.

Another case where reversing the role is difficult is for case where you need to
multiplex the streams (let's use a camera to illustrate) and share that with
multiple processes. In these uses case, the DRM importers are volatile, which
one do you abuse to do allocation from ? In multimedia server like PipeWire, you
are not really aware if the camera will be used by DRM or not, and if something
"special" is needed in term of role inversion. It is relatively easy to deal
with matching modifiers, but using downstream (display/gpu) as an exporter is
always difficult (and require some level of abuse and guessing).

[1]
https://android.googlesource.com/platform/external/minigbm/+/refs/heads/master/rockchip.c#140

> 
> > > > > For DRM and most V4L2 devices I then fill in the dma_coherent flag 
> > > > > based on the
> > > > > return value of dev_is_dma_coherent(). Exporting drivers are allowed 
> > > > > to clear
> > > > > the flag for their buffers if special handling like the USWC flag in 
> > > > > amdgpu or
> > > > > the uncached allocations for radeon/nouveau are in use.
> > > > > 
> > > > I don't think the V4L2 part works for most ARM systems. The default
> > > > there is for devices to be noncoherent unless explicitly marked
> > > > otherwise. I don't think any of the "devices" writing the video buffers
> > > > in cached memory with the CPU do this. While we could probably mark
> > > > them as coherent, I don't think this is moving in the right direction.
> > > Well why not? Those devices are coherent in the sense of the DMA API
> > > that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
> > > 
> > > We could come up with a better name for coherency, e.g. snooping for
> > > example. But that is just an documentation detail.
> > > 
> > I agree that those devices copying data into a CPU cacheable buffer
> > should be marked as coherent, just not sure right now if other things
> > like DMA mappings are done on that device, which would require the
> > cache maintenance.
> 
> Yeah, good point.
> 
> > > And this the exact wrong approach as far as I can see. As Daniel noted
> > > as well we absolutely need some kind of coherency between exporter and
> > > importer.
> > > 
> > I think it's important that we are very specific about the thing we are
> > talking about here: I guess when you say coherency you mean hardware
> > enforced coherency on cacheable memory, which is the default on
> > x86/PCI.
> 
> Well, no. What I mean with coherency is that the devices don't need 
> insert special operation to access each others data.
> 
> This can be archived by multiple approaches, e.g. by the PCI coherency 
> requirements, device internal connections (XGMI, NVLink, CXL etc...) as 
> well as using uncached system memory.
> 
> The key point is what we certainly don't want is special operations 
> which say: Ok, now device A can access the data, now device B. 
> because this breaks tons of use cases.

I'm coming back again with the multiplexing case. We keep having mixed uses case
with multiple receiver. In some case, data may endup on CPU while being encoded
in HW. Current approach of disabling cache does work, but CPU algorithm truly
suffer in performance. Doing a full memcpy to a cached buffer helps, but remains
slower then if the cache had been snooped by the importer (encoder here) driver.

> 
> > The other way to enforce coherency is to either insert cache
> > maintenance operations, or make sure that the buffer is not cacheable

Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Christian König

Am 28.10.22 um 13:42 schrieb Lucas Stach:

Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:

But essentially the right thing to do. The only alternative I can see is
to reverse the role of exporter and importer.


I don't think that would work generally either, as buffer exporter and
importer isn't always a 1:1 thing. As soon as any attached importer has
a different coherency behavior than the others, things fall apart.


I've just mentioned it because somebody noted that when you reverse the 
roles of exporter and importer with the V4L driver and i915 then the use 
case suddenly starts working.



For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
the flag for their buffers if special handling like the USWC flag in amdgpu or
the uncached allocations for radeon/nouveau are in use.


I don't think the V4L2 part works for most ARM systems. The default
there is for devices to be noncoherent unless explicitly marked
otherwise. I don't think any of the "devices" writing the video buffers
in cached memory with the CPU do this. While we could probably mark
them as coherent, I don't think this is moving in the right direction.

Well why not? Those devices are coherent in the sense of the DMA API
that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.

We could come up with a better name for coherency, e.g. snooping for
example. But that is just an documentation detail.


I agree that those devices copying data into a CPU cacheable buffer
should be marked as coherent, just not sure right now if other things
like DMA mappings are done on that device, which would require the
cache maintenance.


Yeah, good point.


And this the exact wrong approach as far as I can see. As Daniel noted
as well we absolutely need some kind of coherency between exporter and
importer.


I think it's important that we are very specific about the thing we are
talking about here: I guess when you say coherency you mean hardware
enforced coherency on cacheable memory, which is the default on
x86/PCI.


Well, no. What I mean with coherency is that the devices don't need 
insert special operation to access each others data.


This can be archived by multiple approaches, e.g. by the PCI coherency 
requirements, device internal connections (XGMI, NVLink, CXL etc...) as 
well as using uncached system memory.


The key point is what we certainly don't want is special operations 
which say: Ok, now device A can access the data, now device B. 
because this breaks tons of use cases.



The other way to enforce coherency is to either insert cache
maintenance operations, or make sure that the buffer is not cacheable
by any device taking part in the sharing, including the CPU.


Yes and no. When we want the devices to interact with each other without 
the CPU then we need some negotiated coherency between the two.



This can either be provided by the PCI specification which makes it
mandatory for device to snoop the caches or by platform devices agreeing
that they only work on uncached memory.

What you disregard here is the fact that there are many systems out
there with mixed topologies, where some masters are part of the
coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs
need to be established by hardware, which is the option that you
strongly prefer, which means forcing all buffers to be uncacheable in a
system with masters that are not coherent, or allowing some form of
bracketed DMA access with cache maintenance ops.


Well I don't prefer that option, it's just the only one I can see. One 
of the main goals of DMA-buf is to allow device to share data without 
the need for CPU interactions.


In other words we negotiate the high level properties and then the 
device can talk to each other without explicitly noting who is accessing 
what.


And this concept is completely incompatible with maintenance ops. We 
made that mistake with SWIOTLB for the DMA API and I don't really want 
to repeat that stunt.



Explicit cache flush operations are simple not part of the design of
DMA-buf because they are not part of the design of the higher level APIs
like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe
blissfully unaware of systems without hardware enforced coherency. But
that isn't the only use for dma-bufs.


Yeah, but the other use cases are extremely limited. As far as I can see



I totally agree that some higher level API primitives aren't possible
without coherency at the hardware level and for those uses we should
require either HW enforced coherency or uncachable memory. But I don't
think we should make things slow deliberately on systems that allow to
optimize things with the help of bracketed access.

If I understood things right your amdgpu use-case even falls into this
category: normally you would want to use 

Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Lucas Stach
Am Freitag, dem 28.10.2022 um 10:40 +0200 schrieb Christian König:
> Hi Lucas,
> 
> Am 28.10.22 um 10:09 schrieb Lucas Stach:
> > Hi Christian,
> > 
> > Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
> > > Hi guys,
> > > 
> > > after finding that we essentially have two separate worlds for coherent 
> > > sharing
> > > of buffer through DMA-buf I thought I will tackle that problem a bit and 
> > > at
> > > least allow the framework to reject attachments which won't work.
> > > 
> > > So those patches here add a new dma_coherent flag to each DMA-buf object
> > > telling the framework that dev_is_dma_coherent() needs to return true for 
> > > an
> > > importing device to be able to attach. Since we should always have a 
> > > fallback
> > > path this should give userspace the chance to still keep the use case 
> > > working,
> > > either by doing a CPU copy instead or reversing the roles of exporter and
> > > importer.
> > > 
> > The fallback would likely be a CPU copy with the appropriate cache
> > flushes done by the device driver, which would be a massiv performance
> > issue.
> 
> But essentially the right thing to do. The only alternative I can see is 
> to reverse the role of exporter and importer.
> 
I don't think that would work generally either, as buffer exporter and
importer isn't always a 1:1 thing. As soon as any attached importer has
a different coherency behavior than the others, things fall apart.

> > > For DRM and most V4L2 devices I then fill in the dma_coherent flag based 
> > > on the
> > > return value of dev_is_dma_coherent(). Exporting drivers are allowed to 
> > > clear
> > > the flag for their buffers if special handling like the USWC flag in 
> > > amdgpu or
> > > the uncached allocations for radeon/nouveau are in use.
> > > 
> > I don't think the V4L2 part works for most ARM systems. The default
> > there is for devices to be noncoherent unless explicitly marked
> > otherwise. I don't think any of the "devices" writing the video buffers
> > in cached memory with the CPU do this. While we could probably mark
> > them as coherent, I don't think this is moving in the right direction.
> 
> Well why not? Those devices are coherent in the sense of the DMA API 
> that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.
> 
> We could come up with a better name for coherency, e.g. snooping for 
> example. But that is just an documentation detail.
> 
I agree that those devices copying data into a CPU cacheable buffer
should be marked as coherent, just not sure right now if other things
like DMA mappings are done on that device, which would require the
cache maintenance.

> > > Additional to that importers can also check the flag if they have some
> > > non-snooping operations like the special scanout case for amdgpu for 
> > > example.
> > > 
> > > The patches are only smoke tested and the solution isn't ideal, but as 
> > > far as
> > > I can see should at least keep things working.
> > > 
> > I would like to see this solved properly. Where I think properly means
> > we make things work on systems with mixed coherent/noncoherent masters
> > in the same way the DMA API does on such systems: by inserting the
> > proper cache maintenance operations.
> 
> And this the exact wrong approach as far as I can see. As Daniel noted 
> as well we absolutely need some kind of coherency between exporter and 
> importer.
> 
I think it's important that we are very specific about the thing we are
talking about here: I guess when you say coherency you mean hardware
enforced coherency on cacheable memory, which is the default on
x86/PCI.

The other way to enforce coherency is to either insert cache
maintenance operations, or make sure that the buffer is not cacheable
by any device taking part in the sharing, including the CPU.

> This can either be provided by the PCI specification which makes it 
> mandatory for device to snoop the caches or by platform devices agreeing 
> that they only work on uncached memory.

What you disregard here is the fact that there are many systems out
there with mixed topologies, where some masters are part of the
coherency domain and some are not.

We have two options here: either mandate that coherency for dma-bufs
need to be established by hardware, which is the option that you
strongly prefer, which means forcing all buffers to be uncacheable in a
system with masters that are not coherent, or allowing some form of
bracketed DMA access with cache maintenance ops.

> 
> Explicit cache flush operations are simple not part of the design of 
> DMA-buf because they are not part of the design of the higher level APIs 
> like Vulkan and OpenGL.

I'm aware that some graphics APIs have been living in a universe
blissfully unaware of systems without hardware enforced coherency. But
that isn't the only use for dma-bufs.

I totally agree that some higher level API primitives aren't possible
without coherency at the hardware level and for those uses we 

Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Christian König

Hi Lucas,

Am 28.10.22 um 10:09 schrieb Lucas Stach:

Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing
of buffer through DMA-buf I thought I will tackle that problem a bit and at
least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object
telling the framework that dev_is_dma_coherent() needs to return true for an
importing device to be able to attach. Since we should always have a fallback
path this should give userspace the chance to still keep the use case working,
either by doing a CPU copy instead or reversing the roles of exporter and
importer.


The fallback would likely be a CPU copy with the appropriate cache
flushes done by the device driver, which would be a massiv performance
issue.


But essentially the right thing to do. The only alternative I can see is 
to reverse the role of exporter and importer.



For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
the flag for their buffers if special handling like the USWC flag in amdgpu or
the uncached allocations for radeon/nouveau are in use.


I don't think the V4L2 part works for most ARM systems. The default
there is for devices to be noncoherent unless explicitly marked
otherwise. I don't think any of the "devices" writing the video buffers
in cached memory with the CPU do this. While we could probably mark
them as coherent, I don't think this is moving in the right direction.


Well why not? Those devices are coherent in the sense of the DMA API 
that they don't need an extra CPU copy on sync_to_cpu/sync_to_device.


We could come up with a better name for coherency, e.g. snooping for 
example. But that is just an documentation detail.



Additional to that importers can also check the flag if they have some
non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as
I can see should at least keep things working.


I would like to see this solved properly. Where I think properly means
we make things work on systems with mixed coherent/noncoherent masters
in the same way the DMA API does on such systems: by inserting the
proper cache maintenance operations.


And this the exact wrong approach as far as I can see. As Daniel noted 
as well we absolutely need some kind of coherency between exporter and 
importer.


This can either be provided by the PCI specification which makes it 
mandatory for device to snoop the caches or by platform devices agreeing 
that they only work on uncached memory.


Explicit cache flush operations are simple not part of the design of 
DMA-buf because they are not part of the design of the higher level APIs 
like Vulkan and OpenGL.


Adding this to DMA-buf for the rather special use case would completely 
break that and make live much more complicated for everybody.



I also think that we should keep in mind that the world is moving into
a direction where DMA masters may not only snoop the CPU caches (what
is the definition of cache coherent on x86), but actually take part in
the system coherence and are able to have writeback caches for shared
data on their own. I can only speculate, as I haven't seen the amdgpu
side yet, but I think this proposal is moving in the other direction by
assuming a central system cache, where the importer has some magic way
to clean this central cache.


What you mean is CXL: https://en.wikipedia.org/wiki/Compute_Express_Link

And yes we support that in a bunch of configurations and also have 
worked with that and amdgpu with DMA-buf based shared.


This should not be a problem with this approach.


Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
dma-buf sharing work on ARM systems and seem to hold some strong
opinions on how this should work, I guess I need to make some time
available to type it up, so we can discuss over coder rather than
abstract ideas. If I come up with something that works for my use-cases
would you be up for taking a shot at a amdgpu implementation?


Well, not really. As I said I see what you have in mind here as 
completely wrong approach we will certainly not support in any GPU driver.


What we can do is to request the use case which won't work and this is 
exactly what the patch here does.


Regards,
Christian.



Regards,
Lucas





Re: Try to address the DMA-buf coherency problem

2022-10-28 Thread Lucas Stach
Hi Christian,

Am Donnerstag, dem 20.10.2022 um 14:13 +0200 schrieb Christian König:
> Hi guys,
> 
> after finding that we essentially have two separate worlds for coherent 
> sharing
> of buffer through DMA-buf I thought I will tackle that problem a bit and at
> least allow the framework to reject attachments which won't work.
> 
> So those patches here add a new dma_coherent flag to each DMA-buf object
> telling the framework that dev_is_dma_coherent() needs to return true for an
> importing device to be able to attach. Since we should always have a fallback
> path this should give userspace the chance to still keep the use case working,
> either by doing a CPU copy instead or reversing the roles of exporter and
> importer.
> 
The fallback would likely be a CPU copy with the appropriate cache
flushes done by the device driver, which would be a massiv performance
issue.

> For DRM and most V4L2 devices I then fill in the dma_coherent flag based on 
> the
> return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
> the flag for their buffers if special handling like the USWC flag in amdgpu or
> the uncached allocations for radeon/nouveau are in use.
> 
I don't think the V4L2 part works for most ARM systems. The default
there is for devices to be noncoherent unless explicitly marked
otherwise. I don't think any of the "devices" writing the video buffers
in cached memory with the CPU do this. While we could probably mark
them as coherent, I don't think this is moving in the right direction.

> Additional to that importers can also check the flag if they have some
> non-snooping operations like the special scanout case for amdgpu for example.
> 
> The patches are only smoke tested and the solution isn't ideal, but as far as
> I can see should at least keep things working.
> 
I would like to see this solved properly. Where I think properly means
we make things work on systems with mixed coherent/noncoherent masters
in the same way the DMA API does on such systems: by inserting the
proper cache maintenance operations.

I also think that we should keep in mind that the world is moving into
a direction where DMA masters may not only snoop the CPU caches (what
is the definition of cache coherent on x86), but actually take part in
the system coherence and are able to have writeback caches for shared
data on their own. I can only speculate, as I haven't seen the amdgpu
side yet, but I think this proposal is moving in the other direction by
assuming a central system cache, where the importer has some magic way
to clean this central cache.

Since I have a vested interest in seeing V4L2 UVC and non-coherent GPU
dma-buf sharing work on ARM systems and seem to hold some strong
opinions on how this should work, I guess I need to make some time
available to type it up, so we can discuss over coder rather than
abstract ideas. If I come up with something that works for my use-cases
would you be up for taking a shot at a amdgpu implementation?

Regards,
Lucas



Re: Try to address the DMA-buf coherency problem

2022-10-27 Thread Christian König

Am 20.10.22 um 14:13 schrieb Christian König:

Hi guys,

after finding that we essentially have two separate worlds for coherent sharing
of buffer through DMA-buf I thought I will tackle that problem a bit and at
least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object
telling the framework that dev_is_dma_coherent() needs to return true for an
importing device to be able to attach. Since we should always have a fallback
path this should give userspace the chance to still keep the use case working,
either by doing a CPU copy instead or reversing the roles of exporter and
importer.

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
the flag for their buffers if special handling like the USWC flag in amdgpu or
the uncached allocations for radeon/nouveau are in use.

Additional to that importers can also check the flag if they have some
non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as
I can see should at least keep things working.


Gentle ping on this. Lucas, Daniel and Nicolas you have been rather 
active in the last discussion. Do you mind taking a look?


Thanks,
Christian.



Please review and/or comment,
Christian.






Try to address the DMA-buf coherency problem

2022-10-20 Thread Christian König
Hi guys,

after finding that we essentially have two separate worlds for coherent sharing
of buffer through DMA-buf I thought I will tackle that problem a bit and at
least allow the framework to reject attachments which won't work.

So those patches here add a new dma_coherent flag to each DMA-buf object
telling the framework that dev_is_dma_coherent() needs to return true for an
importing device to be able to attach. Since we should always have a fallback
path this should give userspace the chance to still keep the use case working,
either by doing a CPU copy instead or reversing the roles of exporter and
importer.

For DRM and most V4L2 devices I then fill in the dma_coherent flag based on the
return value of dev_is_dma_coherent(). Exporting drivers are allowed to clear
the flag for their buffers if special handling like the USWC flag in amdgpu or
the uncached allocations for radeon/nouveau are in use.

Additional to that importers can also check the flag if they have some
non-snooping operations like the special scanout case for amdgpu for example.

The patches are only smoke tested and the solution isn't ideal, but as far as
I can see should at least keep things working.

Please review and/or comment,
Christian.