Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-02-06 Thread Daniel Vetter
On Tue, Feb 06, 2024 at 02:28:35PM +0100, Christian König wrote:
> Am 31.01.24 um 10:07 schrieb Daniel Vetter:
> > On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:
> > > Am 30.01.24 um 11:40 schrieb Daniel Vetter:
> > > > On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:
> > > > > Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
> > > > > >    I would say we start with the DMA-API by getting away from 
> > > > > > sg_tables
> > > > > > to something cleaner and state oriented.
> > > > > FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
> > > > > which is just a dead simple
> > > > > 
> > > > > struct dma_vec {
> > > > > dma_addr_t addr;
> > > > > size_t len;
> > > > > };
> > > > > 
> > > > > (The rationale for introducing it in the IIO DMABUF patchset was that
> > > > > the "scatterlist" wouldn't allow me to change the transfer size.)
> > > > > 
> > > > > So I believe a new "sg_table"-like could just be an array of struct
> > > > > dma_vec + flags.
> > > > Yeah that's pretty much the proposal I've seen, split the sg table into
> > > > input data (struct page + len) and output data (which is the dma_addr_t 
> > > > +
> > > > len you have above).
> > > I would extend that a bit and say we have an array with
> > > dma_addr+power_of_two_order and a header structure with lower bit offset 
> > > and
> > > some DMA transaction flags.
> > > 
> > > But this is something which can be worked as an optimization later on. 
> > > For a
> > > start this proposal here looks good to me as well.
> > > 
> > > > The part I don't expect to ever happen, because it hasn't the past 20 or
> > > > so years, is that the dma-api will give us information about what is
> > > > needed to keep the buffers coherency between various devices and the 
> > > > cpu.
> > > Well maybe that's what we are doing wrong.
> > > 
> > > Instead of asking the dma-api about the necessary information we should 
> > > give
> > > the API the opportunity to work for us.
> > > 
> > > In other words we don't need the information about buffer coherency what 
> > > we
> > > need is that the API works for as and fulfills the requirements we have.
> > > 
> > > So the question is really what should we propose to change on the DMA-api
> > > side to get this working as expected?
> > So one thing I've been pondering, kinda picking up your point about CXL,
> > is that we do make the coherency protocol more explicit by adding a
> > coherency mode to dma_buf that the exporter sets. Some ideas for values
> > this could have:
> > 
> > - ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis.
> >This would be for CXL. Non-CXL devices could still participate with the
> >old model using explicit devices flushes, but must at comply with
> >CPU_COHERENT.
> > 
> >There's also the power9-only nvlink that would fit here, but I guess
> >going forward CXL (and cache-coherent integrated gpu) would really be
> >the only users of this flag.
> > 
> >Peer2peer would have the same rules, otherwise doesn't really make
> >sense. Also we might want to forbib non-CXL imports for these buffers
> >maybe even? Not sure on that.
> > 
> > - CPU_COHERENT: device transactions do snoop cpu devices caches, but
> >devices might do their own caching which isn't snooped by the cpu and
> >needs explicit device-side invalidate/flushing. This means pcie
> >importers are not allowed to use pcie no-snoop transactions, intel igpu
> >wouldn't be allowed to use MOCS that do the same, similar for arm
> >integrated devices.
> > 
> >Importers can skip all explicit cache management apis like
> >dma_buf_begin/end_cpu_access, or the newly proposed
> >dma_buf_begin/end_device_access here.
> > 
> >We'd need to figure out what exactly this means for peer2peer
> >transactions, I have no idea whether the no-snoop flag even does
> >anything for those.
> > 
> >We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and
> >CPU_WOHERENT_WC, so that importers know whether cpu reads are going to
> >be crawling or not.
> > 
> > - MEMORY_COHERENT: devices transactions do not snoop any caches, but
> >promise that all transactions are fully flushed to system memory. Any
> >devices transactions which do fill cpu caches must call the proposed
> >dma_buf_begin/end_device_access functions proposed here. Any cpu access
> >must be braketed by calls to dma_buf_begin/end_cpu_access.
> > 
> >If your device does fill cpu caches, then essentially you'd not be able
> >to import such buffers. Not sure whether we need to put the
> >responsibility of checking that onto importers or exporters. Ideally
> >core dma-buf.c code would check this.
> > 
> >Also maybe the cpu WC mapping mode would actually need to be a sub-mode
> >for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to
> >call 

Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-02-06 Thread Christian König

Am 31.01.24 um 10:07 schrieb Daniel Vetter:

On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:

Am 30.01.24 um 11:40 schrieb Daniel Vetter:

On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

   I would say we start with the DMA-API by getting away from sg_tables
to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
which is just a dead simple

struct dma_vec {
dma_addr_t addr;
size_t len;
};

(The rationale for introducing it in the IIO DMABUF patchset was that
the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct
dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into
input data (struct page + len) and output data (which is the dma_addr_t +
len you have above).

I would extend that a bit and say we have an array with
dma_addr+power_of_two_order and a header structure with lower bit offset and
some DMA transaction flags.

But this is something which can be worked as an optimization later on. For a
start this proposal here looks good to me as well.


The part I don't expect to ever happen, because it hasn't the past 20 or
so years, is that the dma-api will give us information about what is
needed to keep the buffers coherency between various devices and the cpu.

Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should give
the API the opportunity to work for us.

In other words we don't need the information about buffer coherency what we
need is that the API works for as and fulfills the requirements we have.

So the question is really what should we propose to change on the DMA-api
side to get this working as expected?

So one thing I've been pondering, kinda picking up your point about CXL,
is that we do make the coherency protocol more explicit by adding a
coherency mode to dma_buf that the exporter sets. Some ideas for values
this could have:

- ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis.
   This would be for CXL. Non-CXL devices could still participate with the
   old model using explicit devices flushes, but must at comply with
   CPU_COHERENT.

   There's also the power9-only nvlink that would fit here, but I guess
   going forward CXL (and cache-coherent integrated gpu) would really be
   the only users of this flag.

   Peer2peer would have the same rules, otherwise doesn't really make
   sense. Also we might want to forbib non-CXL imports for these buffers
   maybe even? Not sure on that.

- CPU_COHERENT: device transactions do snoop cpu devices caches, but
   devices might do their own caching which isn't snooped by the cpu and
   needs explicit device-side invalidate/flushing. This means pcie
   importers are not allowed to use pcie no-snoop transactions, intel igpu
   wouldn't be allowed to use MOCS that do the same, similar for arm
   integrated devices.

   Importers can skip all explicit cache management apis like
   dma_buf_begin/end_cpu_access, or the newly proposed
   dma_buf_begin/end_device_access here.

   We'd need to figure out what exactly this means for peer2peer
   transactions, I have no idea whether the no-snoop flag even does
   anything for those.

   We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and
   CPU_WOHERENT_WC, so that importers know whether cpu reads are going to
   be crawling or not.

- MEMORY_COHERENT: devices transactions do not snoop any caches, but
   promise that all transactions are fully flushed to system memory. Any
   devices transactions which do fill cpu caches must call the proposed
   dma_buf_begin/end_device_access functions proposed here. Any cpu access
   must be braketed by calls to dma_buf_begin/end_cpu_access.

   If your device does fill cpu caches, then essentially you'd not be able
   to import such buffers. Not sure whether we need to put the
   responsibility of checking that onto importers or exporters. Ideally
   core dma-buf.c code would check this.

   Also maybe the cpu WC mapping mode would actually need to be a sub-mode
   for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to
   call dma_buf_begin/end_cpu_access, you would still need your devices to
   be memory coherent. And if they're not, then you cannot use that
   dma-buf.

   Or maybe alternatively we need to guarantee that exporters which set
   MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things
   work for these cpu-coherent but not memory-coherent devices. This
   becomes very tricky with device/arch/bus specific details I think.

- DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and
   the exact coherency mode is not know. Importers _must_ braket all cpu
   and device access with the respective dma_buf functions. This is
   essentially the 

Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-31 Thread Daniel Vetter
On Tue, Jan 30, 2024 at 02:09:45PM +0100, Christian König wrote:
> Am 30.01.24 um 11:40 schrieb Daniel Vetter:
> > On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:
> > > Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
> > > >   I would say we start with the DMA-API by getting away from sg_tables
> > > > to something cleaner and state oriented.
> > > FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
> > > which is just a dead simple
> > > 
> > > struct dma_vec {
> > >dma_addr_t addr;
> > >size_t len;
> > > };
> > > 
> > > (The rationale for introducing it in the IIO DMABUF patchset was that
> > > the "scatterlist" wouldn't allow me to change the transfer size.)
> > > 
> > > So I believe a new "sg_table"-like could just be an array of struct
> > > dma_vec + flags.
> > Yeah that's pretty much the proposal I've seen, split the sg table into
> > input data (struct page + len) and output data (which is the dma_addr_t +
> > len you have above).
> 
> I would extend that a bit and say we have an array with
> dma_addr+power_of_two_order and a header structure with lower bit offset and
> some DMA transaction flags.
> 
> But this is something which can be worked as an optimization later on. For a
> start this proposal here looks good to me as well.
> 
> > The part I don't expect to ever happen, because it hasn't the past 20 or
> > so years, is that the dma-api will give us information about what is
> > needed to keep the buffers coherency between various devices and the cpu.
> 
> Well maybe that's what we are doing wrong.
> 
> Instead of asking the dma-api about the necessary information we should give
> the API the opportunity to work for us.
> 
> In other words we don't need the information about buffer coherency what we
> need is that the API works for as and fulfills the requirements we have.
> 
> So the question is really what should we propose to change on the DMA-api
> side to get this working as expected?

So one thing I've been pondering, kinda picking up your point about CXL,
is that we do make the coherency protocol more explicit by adding a
coherency mode to dma_buf that the exporter sets. Some ideas for values
this could have:

- ATTOMIC_COHERENT: Fully cache coherent, including device/cpu atomis.
  This would be for CXL. Non-CXL devices could still participate with the
  old model using explicit devices flushes, but must at comply with
  CPU_COHERENT.

  There's also the power9-only nvlink that would fit here, but I guess
  going forward CXL (and cache-coherent integrated gpu) would really be
  the only users of this flag.

  Peer2peer would have the same rules, otherwise doesn't really make
  sense. Also we might want to forbib non-CXL imports for these buffers
  maybe even? Not sure on that.

- CPU_COHERENT: device transactions do snoop cpu devices caches, but
  devices might do their own caching which isn't snooped by the cpu and
  needs explicit device-side invalidate/flushing. This means pcie
  importers are not allowed to use pcie no-snoop transactions, intel igpu
  wouldn't be allowed to use MOCS that do the same, similar for arm
  integrated devices.

  Importers can skip all explicit cache management apis like
  dma_buf_begin/end_cpu_access, or the newly proposed
  dma_buf_begin/end_device_access here.

  We'd need to figure out what exactly this means for peer2peer
  transactions, I have no idea whether the no-snoop flag even does
  anything for those.

  We might also want to split up CPU_COHERENT into CPU_COHERENT_WB and
  CPU_WOHERENT_WC, so that importers know whether cpu reads are going to
  be crawling or not.

- MEMORY_COHERENT: devices transactions do not snoop any caches, but
  promise that all transactions are fully flushed to system memory. Any
  devices transactions which do fill cpu caches must call the proposed
  dma_buf_begin/end_device_access functions proposed here. Any cpu access
  must be braketed by calls to dma_buf_begin/end_cpu_access.

  If your device does fill cpu caches, then essentially you'd not be able
  to import such buffers. Not sure whether we need to put the
  responsibility of checking that onto importers or exporters. Ideally
  core dma-buf.c code would check this.

  Also maybe the cpu WC mapping mode would actually need to be a sub-mode
  for MEMORY_COHERENT, because all cpu wc achieves is to avoid the need to
  call dma_buf_begin/end_cpu_access, you would still need your devices to
  be memory coherent. And if they're not, then you cannot use that
  dma-buf.

  Or maybe alternatively we need to guarantee that exporters which set
  MEMORY_COHERENT implement dma_buf_begin/end_device_access to make things
  work for these cpu-coherent but not memory-coherent devices. This
  becomes very tricky with device/arch/bus specific details I think.

- DMA_API_COHERENT: The memory is allocated or mapped by the dma-api, and
  the exact coherency mode is not know. Importers _must_ braket all cpu
  and device 

Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Christian König

Am 30.01.24 um 11:40 schrieb Daniel Vetter:

On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:

Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :

  I would say we start with the DMA-API by getting away from sg_tables
to something cleaner and state oriented.

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
which is just a dead simple

struct dma_vec {
   dma_addr_t addr;
   size_t len;
};

(The rationale for introducing it in the IIO DMABUF patchset was that
the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct
dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into
input data (struct page + len) and output data (which is the dma_addr_t +
len you have above).


I would extend that a bit and say we have an array with 
dma_addr+power_of_two_order and a header structure with lower bit offset 
and some DMA transaction flags.


But this is something which can be worked as an optimization later on. 
For a start this proposal here looks good to me as well.



The part I don't expect to ever happen, because it hasn't the past 20 or
so years, is that the dma-api will give us information about what is
needed to keep the buffers coherency between various devices and the cpu.


Well maybe that's what we are doing wrong.

Instead of asking the dma-api about the necessary information we should 
give the API the opportunity to work for us.


In other words we don't need the information about buffer coherency what 
we need is that the API works for as and fulfills the requirements we have.


So the question is really what should we propose to change on the 
DMA-api side to get this working as expected?


Regards,
Christian.






-Sima




Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Daniel Vetter
On Tue, Jan 30, 2024 at 10:48:23AM +0100, Paul Cercueil wrote:
> Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
> >  I would say we start with the DMA-API by getting away from sg_tables
> > to something cleaner and state oriented. 
> 
> FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
> which is just a dead simple
> 
> struct dma_vec {
>   dma_addr_t addr;
>   size_t len;
> };
> 
> (The rationale for introducing it in the IIO DMABUF patchset was that
> the "scatterlist" wouldn't allow me to change the transfer size.)
> 
> So I believe a new "sg_table"-like could just be an array of struct
> dma_vec + flags.

Yeah that's pretty much the proposal I've seen, split the sg table into
input data (struct page + len) and output data (which is the dma_addr_t +
len you have above).

The part I don't expect to ever happen, because it hasn't the past 20 or
so years, is that the dma-api will give us information about what is
needed to keep the buffers coherency between various devices and the cpu.
-Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch


Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Daniel Vetter
On Tue, Jan 30, 2024 at 10:23:03AM +0100, Christian König wrote:
> Am 30.01.24 um 10:01 schrieb Daniel Vetter:
> > On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:
> > > [SNIP]
> > > Well I think we should have some solution, but I'm not sure if it should 
> > > be
> > > part of DMA-buf.
> > > 
> > > Essentially a DMA-buf exports the buffers as he uses it and the importer 
> > > (or
> > > the DMA-buf subsystem) is then the one who says ok I can use this or I 
> > > can't
> > > use this or I need to call extra functions to use this or whatever.
> > > 
> > > It's not the job of the exporter to provide the coherency for the 
> > > importer,
> > > cause otherwise we would have a lot of code in the exporter which can only
> > > be tested when you have the right importer around. And I strongly think 
> > > that
> > > this is a no-go for having a reliable solution.
> > The trouble is, that if you have other memory than stuff allocated by the
> > dma-api or mapped using the dma-api, then by necessity the exporter has to
> > deal with this.
> 
> Yes, I was thinking about that as well.
> 
> > Which is the exact same reason we also force the exporters to deal with
> > the cpu cache flushing - you're argument that it's not great to replicate
> > this everywhere holds there equally.
> 
> And I'm not really happy with that either.
> 
> > The other thing is that right now the exporter is the only one who
> > actually knows what kind of dma coherency rules apply for a certain piece
> > of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying
> > i915-gem buffer might be non-coherent, and i915-gem makes it all work by
> > doing the appropriate amount of clflush.
> 
> Yeah, exactly that's the reason why I think that this stuff doesn't belong
> into exporters/drivers.
> 
> Looking at what kind of hacks and workarounds we have in both amdgpu as well
> as i915 it's pretty clear that we need to improve this design somehow.

Yeah it's been a well-known issue, and we've very slowly improved things.

> > Similar funky things happen in other cases.
> > 
> > So unless we add an interface which allows importers to figure out how
> > much flushing is needed, currently the exporter is the only one who knows
> > (because it can inspect the struct device at dma_buf_attach time).
> > 
> > We could flip this around, but it would be a rather serious depature from
> > the dma-buf design approach thus far.
> 
> Well clients already give the DMA-direction to exporters when creating the
> mapping and get an appropriate sg_table in return.
> 
> All we need to do is getting the information what flushing is needed into
> the object returned here so that the DMA API can work with it.

So the problem is that we can provide this information from exporters that
do device specific stuff. But we cannot get this information from
exporters which just use the dma-api, whether it's dma_alloc or
dma_map_sg, because the core design principle of the dma-api is to hide
the coherency rules for device dma.

The idea is that you have the same ip on different socs, where on one the
soc needs cache flushing and on the other you dont (because different
architecture, or just the ip being connected to different interconnects),
you can use the exact same driver since the dma-api hides all this.

And at least every time it was discussed in the past, dma-api maintainers
insisted that we don't break this abstraction rule. Which means for most
exporters, we simply do not have this information available. This is also
why after epic long discussions it was decided that cache coherency was
the exporter's problem, so that from an importer pov there's no difference
between an sg list optained through dma_buf_map and an sg list obtained
from dma_map_sg or memory allocated with dma_alloc - in none of these
cases does the driver have to do its own cache management.

> Christoph Hellwig pretty much nailed it when he said that the problem with
> the sg_table is that it mixes input and output parameters of the DMA-API.

Hm my take away from these discussions was that sg as a data structure is
not a clean design, but I haven't ever seen Christoph (or anyone else from
the dma-api side) say that they're ok with leaking cache coherency
management to clients.

We couldn't even get the core arch primitives exported to drivers so that
dma-buf exporters could do the right cache management for their driver
specific allocators that entirely bypass the dma-api. I think what you're
suggesting would go way beyond that.

> I would extend that and say that we need a mapping object the DMA-API can
> work with so that it can know what needs to be done when devices request
> that data is made coherent between them or the CPU.

Personally I do think it makes sense as a design and iirc we discussed it
plenty in the early dma-buf discussions. I just don't think it's a
realistic design approach to upstream.

I think best we can hope for is a new set of device2device sync 

Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Paul Cercueil
Hi Christian,

(Your email software is configured for HTML btw)

Le mardi 30 janvier 2024 à 10:23 +0100, Christian König a écrit :
>  Am 30.01.24 um 10:01 schrieb Daniel Vetter:
>  
> >  
> > On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:
> >  
> > > [SNIP] 
> > > Well I think we should have some solution, but I'm not sure if it
> > > should be
> > > part of DMA-buf.
> > > 
> > > Essentially a DMA-buf exports the buffers as he uses it and the
> > > importer (or
> > > the DMA-buf subsystem) is then the one who says ok I can use this
> > > or I can't
> > > use this or I need to call extra functions to use this or
> > > whatever.
> > > 
> > > It's not the job of the exporter to provide the coherency for the
> > > importer,
> > > cause otherwise we would have a lot of code in the exporter which
> > > can only
> > > be tested when you have the right importer around. And I strongly
> > > think that
> > > this is a no-go for having a reliable solution.
> > >  
> >  
> > The trouble is, that if you have other memory than stuff allocated
> > by the
> > dma-api or mapped using the dma-api, then by necessity the exporter
> > has to
> > deal with this.
> >  
>  
>  Yes, I was thinking about that as well.
>  
>  
> >  
> > Which is the exact same reason we also force the exporters to deal
> > with
> > the cpu cache flushing - you're argument that it's not great to
> > replicate
> > this everywhere holds there equally.
> >  
>  
>  And I'm not really happy with that either.
>  
>  
> >  
> > The other thing is that right now the exporter is the only one who
> > actually knows what kind of dma coherency rules apply for a certain
> > piece
> > of memory. E.g. on i915-gem even if it's dma_map_sg mapped the
> > underlying
> > i915-gem buffer might be non-coherent, and i915-gem makes it all
> > work by
> > doing the appropriate amount of clflush.
> >  
>  
>  Yeah, exactly that's the reason why I think that this stuff doesn't
> belong into exporters/drivers.
>  
>  Looking at what kind of hacks and workarounds we have in both amdgpu
> as well as i915 it's pretty clear that we need to improve this design
> somehow.
>  
>  
> >  
> > Similar funky things happen in other cases.
> > 
> > So unless we add an interface which allows importers to figure out
> > how
> > much flushing is needed, currently the exporter is the only one who
> > knows
> > (because it can inspect the struct device at dma_buf_attach time).
> > 
> > We could flip this around, but it would be a rather serious
> > depature from
> > the dma-buf design approach thus far.
> >  
>  
>  Well clients already give the DMA-direction to exporters when
> creating the mapping and get an appropriate sg_table in return.
>  
>  All we need to do is getting the information what flushing is needed
> into the object returned here so that the DMA API can work with it.
>  
>  Christoph Hellwig pretty much nailed it when he said that the
> problem with the sg_table is that it mixes input and output
> parameters of the DMA-API.
>  
>  I would extend that and say that we need a mapping object the DMA-
> API can work with so that it can know what needs to be done when
> devices request that data is made coherent between them or the CPU.
>  
>  
> >  
> > >  
> > > That's why I think the approach of having DMA-buf callbacks is
> > > most likely
> > > the wrong thing to do.
> > > 
> > > What should happen instead is that the DMA subsystem provides
> > > functionality
> > > which to devices which don't support coherency through it's
> > > connection to
> > > say I want to access this data, please make sure to flush the
> > > appropriate
> > > catches. But that's just a very very rough design idea.
> > > 
> > > This will become more with CXL at the horizon I think.
> > >  
> >  
> > Yeah CXL will make this all even more fun, but we are firmly there
> > already
> > with devices deciding per-buffer (or sometimes even per-access with
> > intel's MOCS stuff) what coherency mode to use for a buffer.
> > 
> > Also arm soc generally have both coherent and non-coherent device
> > interconnects, and I think some devices can switch with runtime
> > flags too
> > which mode they use for a specific transition.
> > 
> > CXL just extends this to pcie devices.
> > 
> > So the mess is here, how do we deal with it?
> >  
>  
>  I would say we start with the DMA-API by getting away from sg_tables
> to something cleaner and state oriented. 

FYI I am already adding a 'dma_vec' object in my IIO DMABUF patchset,
which is just a dead simple

struct dma_vec {
  dma_addr_t addr;
  size_t len;
};

(The rationale for introducing it in the IIO DMABUF patchset was that
the "scatterlist" wouldn't allow me to change the transfer size.)

So I believe a new "sg_table"-like could just be an array of struct
dma_vec + flags.

Cheers,
-Paul

> >  
> > 
> > My take is that the opt-in callback addition is far from great, but
> > it's
> > in line with how we extended dma-buf the past decade plus too. So
> 

Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Christian König

Am 30.01.24 um 10:01 schrieb Daniel Vetter:

On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:

[SNIP]
Well I think we should have some solution, but I'm not sure if it should be
part of DMA-buf.

Essentially a DMA-buf exports the buffers as he uses it and the importer (or
the DMA-buf subsystem) is then the one who says ok I can use this or I can't
use this or I need to call extra functions to use this or whatever.

It's not the job of the exporter to provide the coherency for the importer,
cause otherwise we would have a lot of code in the exporter which can only
be tested when you have the right importer around. And I strongly think that
this is a no-go for having a reliable solution.

The trouble is, that if you have other memory than stuff allocated by the
dma-api or mapped using the dma-api, then by necessity the exporter has to
deal with this.


Yes, I was thinking about that as well.


Which is the exact same reason we also force the exporters to deal with
the cpu cache flushing - you're argument that it's not great to replicate
this everywhere holds there equally.


And I'm not really happy with that either.


The other thing is that right now the exporter is the only one who
actually knows what kind of dma coherency rules apply for a certain piece
of memory. E.g. on i915-gem even if it's dma_map_sg mapped the underlying
i915-gem buffer might be non-coherent, and i915-gem makes it all work by
doing the appropriate amount of clflush.


Yeah, exactly that's the reason why I think that this stuff doesn't 
belong into exporters/drivers.


Looking at what kind of hacks and workarounds we have in both amdgpu as 
well as i915 it's pretty clear that we need to improve this design somehow.



Similar funky things happen in other cases.

So unless we add an interface which allows importers to figure out how
much flushing is needed, currently the exporter is the only one who knows
(because it can inspect the struct device at dma_buf_attach time).

We could flip this around, but it would be a rather serious depature from
the dma-buf design approach thus far.


Well clients already give the DMA-direction to exporters when creating 
the mapping and get an appropriate sg_table in return.


All we need to do is getting the information what flushing is needed 
into the object returned here so that the DMA API can work with it.


Christoph Hellwig pretty much nailed it when he said that the problem 
with the sg_table is that it mixes input and output parameters of the 
DMA-API.


I would extend that and say that we need a mapping object the DMA-API 
can work with so that it can know what needs to be done when devices 
request that data is made coherent between them or the CPU.



That's why I think the approach of having DMA-buf callbacks is most likely
the wrong thing to do.

What should happen instead is that the DMA subsystem provides functionality
which to devices which don't support coherency through it's connection to
say I want to access this data, please make sure to flush the appropriate
catches. But that's just a very very rough design idea.

This will become more with CXL at the horizon I think.

Yeah CXL will make this all even more fun, but we are firmly there already
with devices deciding per-buffer (or sometimes even per-access with
intel's MOCS stuff) what coherency mode to use for a buffer.

Also arm soc generally have both coherent and non-coherent device
interconnects, and I think some devices can switch with runtime flags too
which mode they use for a specific transition.

CXL just extends this to pcie devices.

So the mess is here, how do we deal with it?


I would say we start with the DMA-API by getting away from sg_tables to 
something cleaner and state oriented.




My take is that the opt-in callback addition is far from great, but it's
in line with how we extended dma-buf the past decade plus too. So unless
someone's volunteering to pour some serious time into re-engineering this
all (including testing all the different device/driver<->device/driver
interactions) I think there's only really one other option: To not support
these cases at all. And I don't really like that, because it means people
will hack together something even worse in their drivers.

By adding it to dma-buf it'll stare us in our faces at least :-/


Yeah, it's the way of the least resistance. But with CXL at the horizon 
and more and more drivers using it I think it's predictable that this 
will sooner or later blow up.


Cheers,
Christian.



Cheers, Sima


Regards,
Christian.


Cheers, Sima

___
Linaro-mm-sig mailing list --linaro-mm-...@lists.linaro.org
To unsubscribe send an email tolinaro-mm-sig-le...@lists.linaro.org


Re: [Linaro-mm-sig] Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-30 Thread Daniel Vetter
On Fri, Jan 26, 2024 at 05:42:50PM +0100, Christian König wrote:
> Am 25.01.24 um 19:01 schrieb Daniel Vetter:
> > On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:
> > > Am 24.01.24 um 11:58 schrieb Paul Cercueil:
> > > > [SNIP]
> > > > > > The problem was then that dma_buf_unmap_attachment cannot be called
> > > > > > before the dma_fence is signaled, and calling it after is already
> > > > > > too
> > > > > > late (because the fence would be signaled before the data is
> > > > > > sync'd).
> > > > >    Well what sync are you talking about? CPU sync? In DMA-buf that is
> > > > > handled differently.
> > > > >    For importers it's mandatory that they can be coherent with the
> > > > > exporter. That usually means they can snoop the CPU cache if the
> > > > > exporter can snoop the CPU cache.
> > > > I seem to have such a system where one device can snoop the CPU cache
> > > > and the other cannot. Therefore if I want to support it properly, I do
> > > > need cache flush/sync. I don't actually try to access the data using
> > > > the CPU (and when I do, I call the sync start/end ioctls).
> > > Usually that isn't a problem as long as you don't access the data with the
> > > CPU.
> > > 
> > > [SNIP]
> > > 
> > > > > > (and I *think* there is a way to force coherency in the
> > > > > > Ultrascale's
> > > > > > interconnect - we're investigating it)
> > > > >    What you can do is that instead of using udmabuf or dma-heaps is
> > > > > that the device which can't provide coherency act as exporters of the
> > > > > buffers.
> > > > >    The exporter is allowed to call sync_for_cpu/sync_for_device on 
> > > > > it's
> > > > > own buffers and also gets begin/end CPU access notfications. So you
> > > > > can then handle coherency between the exporter and the CPU.
> > > > But again that would only work if the importers would call
> > > > begin_cpu_access() / end_cpu_access(), which they don't, because they
> > > > don't actually access the data using the CPU.
> > > Wow, that is a completely new use case then.
> > > 
> > > Neither DMA-buf nor the DMA subsystem in Linux actually supports this as 
> > > far
> > > as I can see.
> > > 
> > > > Unless you mean that the exporter can call sync_for_cpu/sync_for_device
> > > > before/after every single DMA transfer so that the data appears
> > > > coherent to the importers, without them having to call
> > > > begin_cpu_access() / end_cpu_access().
> > > Yeah, I mean the importers don't have to call begin_cpu_access() /
> > > end_cpu_access() if they don't do CPU access :)
> > > 
> > > What you can still do as exporter is to call sync_for_device() and
> > > sync_for_cpu() before and after each operation on your non-coherent 
> > > device.
> > > Paired with the fence signaling that should still work fine then.
> > > 
> > > But taking a step back, this use case is not something even the low level
> > > DMA subsystem supports. That sync_for_cpu() does the right thing is
> > > coincident and not proper engineering.
> > > 
> > > What you need is a sync_device_to_device() which does the appropriate
> > > actions depending on which devices are involved.
> > > 
> > > > In which case - this would still demultiply the complexity; my USB-
> > > > functionfs interface here (and IIO interface in the separate patchset)
> > > > are not device-specific, so I'd rather keep them importers.
> > > > >    If you really don't have coherency between devices then that would
> > > > > be a really new use case and we would need much more agreement on how
> > > > > to do this.
> > > > [snip]
> > > > 
> > > > Agreed. Desiging a good generic solution would be better.
> > > > 
> > > > With that said...
> > > > 
> > > > Let's keep it out of this USB-functionfs interface for now. The
> > > > interface does work perfectly fine on platforms that don't have
> > > > coherency problems. The coherency issue in itself really is a
> > > > tangential issue.
> > > Yeah, completely agree.
> > > 
> > > > So I will send a v6 where I don't try to force the cache coherency -
> > > > and instead assume that the attached devices are coherent between
> > > > themselves.
> > > > 
> > > > But it would be even better to have a way to detect non-coherency and
> > > > return an error on attach.
> > > Take a look into the DMA subsystem. I'm pretty sure we already have
> > > something like this in there.
> > > 
> > > If nothing else helps you could take a look if the coherent memory access
> > > mask is non zero or something like that.
> > Jumping in way late and apolgies to everyone since yes I indeed suggested
> > this entire mess to Paul in some private thread.
> > 
> > And worse, I think we need it, it's just that we got away without it thus
> > far.
> > 
> > So way back at the og dma-buf kick-off dma coherency was discussed, and a
> > few things where noted:
> > - the dma api only supports device<->cpu coherency
> > - getting the full coherency model off the ground right away is probably
> >too hard, so we 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-26 Thread Christian König

Am 25.01.24 um 19:01 schrieb Daniel Vetter:

On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:

Am 24.01.24 um 11:58 schrieb Paul Cercueil:

[SNIP]

The problem was then that dma_buf_unmap_attachment cannot be called
before the dma_fence is signaled, and calling it after is already
too
late (because the fence would be signaled before the data is
sync'd).

   Well what sync are you talking about? CPU sync? In DMA-buf that is
handled differently.
   For importers it's mandatory that they can be coherent with the
exporter. That usually means they can snoop the CPU cache if the
exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache
and the other cannot. Therefore if I want to support it properly, I do
need cache flush/sync. I don't actually try to access the data using
the CPU (and when I do, I call the sync start/end ioctls).

Usually that isn't a problem as long as you don't access the data with the
CPU.

[SNIP]


(and I *think* there is a way to force coherency in the
Ultrascale's
interconnect - we're investigating it)

   What you can do is that instead of using udmabuf or dma-heaps is
that the device which can't provide coherency act as exporters of the
buffers.
   The exporter is allowed to call sync_for_cpu/sync_for_device on it's
own buffers and also gets begin/end CPU access notfications. So you
can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call
begin_cpu_access() / end_cpu_access(), which they don't, because they
don't actually access the data using the CPU.

Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far
as I can see.


Unless you mean that the exporter can call sync_for_cpu/sync_for_device
before/after every single DMA transfer so that the data appears
coherent to the importers, without them having to call
begin_cpu_access() / end_cpu_access().

Yeah, I mean the importers don't have to call begin_cpu_access() /
end_cpu_access() if they don't do CPU access :)

What you can still do as exporter is to call sync_for_device() and
sync_for_cpu() before and after each operation on your non-coherent device.
Paired with the fence signaling that should still work fine then.

But taking a step back, this use case is not something even the low level
DMA subsystem supports. That sync_for_cpu() does the right thing is
coincident and not proper engineering.

What you need is a sync_device_to_device() which does the appropriate
actions depending on which devices are involved.


In which case - this would still demultiply the complexity; my USB-
functionfs interface here (and IIO interface in the separate patchset)
are not device-specific, so I'd rather keep them importers.

   If you really don't have coherency between devices then that would
be a really new use case and we would need much more agreement on how
to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The
interface does work perfectly fine on platforms that don't have
coherency problems. The coherency issue in itself really is a
tangential issue.

Yeah, completely agree.


So I will send a v6 where I don't try to force the cache coherency -
and instead assume that the attached devices are coherent between
themselves.

But it would be even better to have a way to detect non-coherency and
return an error on attach.

Take a look into the DMA subsystem. I'm pretty sure we already have
something like this in there.

If nothing else helps you could take a look if the coherent memory access
mask is non zero or something like that.

Jumping in way late and apolgies to everyone since yes I indeed suggested
this entire mess to Paul in some private thread.

And worse, I think we need it, it's just that we got away without it thus
far.

So way back at the og dma-buf kick-off dma coherency was discussed, and a
few things where noted:
- the dma api only supports device<->cpu coherency
- getting the full coherency model off the ground right away is probably
   too hard, so we made the decision that where it matters, relevant
   flushing needs to be done in dma_buf_map/unmap.

If you look at the earliest patches for dma-buf we had pretty clear
language that all dma-operations should be bracketed with map/unmap. Of
course that didn't work out for drm at all, and we had to first get
dma_resv_lock and dma_fence landed and then your dynamic exporter/importer
support in just to get the buffer migration functionality working, which
was only one of the things discussed that braketing everything with
map/unmap was supposed to take care of.

The other was coherency management. But looking through archives I think
this was already agreed to be postponed for later in the original kick-off
meeting and never further discussed on the mailing list.

This worked for a 

Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-25 Thread Daniel Vetter
On Fri, Jan 19, 2024 at 03:13:57PM +0100, Paul Cercueil wrote:
> These functions should be used by device drivers when they start and
> stop accessing the data of DMABUF. It allows DMABUF importers to cache
> the dma_buf_attachment while ensuring that the data they want to access
> is available for their device when the DMA transfers take place.
> 
> Signed-off-by: Paul Cercueil 

Putting my detailed review comments here just so I don't have to remember
them any longer. We need to reach consensus on the big picture direction
first.

> 
> ---
> v5: New patch
> ---
>  drivers/dma-buf/dma-buf.c | 66 +++
>  include/linux/dma-buf.h   | 37 ++
>  2 files changed, 103 insertions(+)
> 
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index 8fe5aa67b167..a8bab6c18fcd 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct 
> dma_buf_attachment *attach,
>   * - dma_buf_mmap()
>   * - dma_buf_begin_cpu_access()
>   * - dma_buf_end_cpu_access()
> + * - dma_buf_begin_access()
> + * - dma_buf_end_access()
>   * - dma_buf_map_attachment_unlocked()
>   * - dma_buf_unmap_attachment_unlocked()
>   * - dma_buf_vmap_unlocked()
> @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, 
> struct iosys_map *map)
>  }
>  EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
>  
> +/**
> + * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF
> + * @attach:  [in]attachment used for hardware access
> + * @sg_table:[in]scatterlist used for the DMA transfer
> + * @direction:  [in]direction of DMA transfer

I think for the kerneldoc would be good to point at the other function
here, explain why this might be needed and that for most reasonable
devices it's probably not, and link between the function pairs.

Also we need to document that dma_buf_map does an implied
dma_buf_begin_access (because dma_sg_map does an implied
dma_sg_sync_for_device) and vice versa for dma_buf_end_access. Which also
means that dma_buf_map/unmap should link to these functions in their
kerneldoc too.

Finally I think we should document here that it's ok to call these from
dma_fence signalling critical section and link to the relevant discussion
in the dma_fence docs for that.

> + */
> +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> +  struct sg_table *sgt, enum dma_data_direction dir)
> +{
> + struct dma_buf *dmabuf;
> + bool cookie;
> + int ret;
> +
> + if (WARN_ON(!attach))
> + return -EINVAL;
> +
> + dmabuf = attach->dmabuf;
> +
> + if (!dmabuf->ops->begin_access)
> + return 0;
> +
> + cookie = dma_fence_begin_signalling();
> + ret = dmabuf->ops->begin_access(attach, sgt, dir);
> + dma_fence_end_signalling(cookie);
> +
> + if (WARN_ON_ONCE(ret))
> + return ret;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);

So explicit device side coherency management is not going to be very
compatible with dynamic buffer managament where the exporter can move the
buffer around. The reason for that is that for a dynamic exporter we cache
the sg mapping, which means any device-side coherency management which
dma_buf_map/unmap would do will not happen (since it's cached),
potentially breaking things for importers that rely on the assumption that
dma_buf_map/unmap already implies dma_buf_begin/end_device_access.

I think for now it's sufficient to put a WARN_ON(dma_buf_is_dymamic() &&
ops->begin|end_access) or similar into dma_buf_export and bail out with an
error to catch that.

Aside from the nits I do think this is roughly what we brievely discussed
well over a decade ago in the original dma-buf kickoff meeting at a linaro
connect in Budapest :-)

Cheers, Sima

> +
> +/**
> + * @dma_buf_end_access - Call after any hardware access from/to the DMABUF
> + * @attach:  [in]attachment used for hardware access
> + * @sg_table:[in]scatterlist used for the DMA transfer
> + * @direction:  [in]direction of DMA transfer
> + */
> +int dma_buf_end_access(struct dma_buf_attachment *attach,
> +struct sg_table *sgt, enum dma_data_direction dir)
> +{
> + struct dma_buf *dmabuf;
> + bool cookie;
> + int ret;
> +
> + if (WARN_ON(!attach))
> + return -EINVAL;
> +
> + dmabuf = attach->dmabuf;
> +
> + if (!dmabuf->ops->end_access)
> + return 0;
> +
> + cookie = dma_fence_begin_signalling();
> + ret = dmabuf->ops->end_access(attach, sgt, dir);
> + dma_fence_end_signalling(cookie);
> +
> + if (WARN_ON_ONCE(ret))
> + return ret;
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
> +
>  #ifdef CONFIG_DEBUG_FS
>  static int dma_buf_debug_show(struct 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-25 Thread Daniel Vetter
On Thu, Jan 25, 2024 at 04:00:16PM +0100, Christian König wrote:
> Am 24.01.24 um 11:58 schrieb Paul Cercueil:
> > [SNIP]
> > > > The problem was then that dma_buf_unmap_attachment cannot be called
> > > > before the dma_fence is signaled, and calling it after is already
> > > > too
> > > > late (because the fence would be signaled before the data is
> > > > sync'd).
> > >   Well what sync are you talking about? CPU sync? In DMA-buf that is
> > > handled differently.
> > >   For importers it's mandatory that they can be coherent with the
> > > exporter. That usually means they can snoop the CPU cache if the
> > > exporter can snoop the CPU cache.
> > I seem to have such a system where one device can snoop the CPU cache
> > and the other cannot. Therefore if I want to support it properly, I do
> > need cache flush/sync. I don't actually try to access the data using
> > the CPU (and when I do, I call the sync start/end ioctls).
> 
> Usually that isn't a problem as long as you don't access the data with the
> CPU.
> 
> [SNIP]
> 
> > > > (and I *think* there is a way to force coherency in the
> > > > Ultrascale's
> > > > interconnect - we're investigating it)
> > >   What you can do is that instead of using udmabuf or dma-heaps is
> > > that the device which can't provide coherency act as exporters of the
> > > buffers.
> > >   The exporter is allowed to call sync_for_cpu/sync_for_device on it's
> > > own buffers and also gets begin/end CPU access notfications. So you
> > > can then handle coherency between the exporter and the CPU.
> > But again that would only work if the importers would call
> > begin_cpu_access() / end_cpu_access(), which they don't, because they
> > don't actually access the data using the CPU.
> 
> Wow, that is a completely new use case then.
> 
> Neither DMA-buf nor the DMA subsystem in Linux actually supports this as far
> as I can see.
> 
> > Unless you mean that the exporter can call sync_for_cpu/sync_for_device
> > before/after every single DMA transfer so that the data appears
> > coherent to the importers, without them having to call
> > begin_cpu_access() / end_cpu_access().
> 
> Yeah, I mean the importers don't have to call begin_cpu_access() /
> end_cpu_access() if they don't do CPU access :)
> 
> What you can still do as exporter is to call sync_for_device() and
> sync_for_cpu() before and after each operation on your non-coherent device.
> Paired with the fence signaling that should still work fine then.
> 
> But taking a step back, this use case is not something even the low level
> DMA subsystem supports. That sync_for_cpu() does the right thing is
> coincident and not proper engineering.
> 
> What you need is a sync_device_to_device() which does the appropriate
> actions depending on which devices are involved.
> 
> > In which case - this would still demultiply the complexity; my USB-
> > functionfs interface here (and IIO interface in the separate patchset)
> > are not device-specific, so I'd rather keep them importers.
> > >   If you really don't have coherency between devices then that would
> > > be a really new use case and we would need much more agreement on how
> > > to do this.
> > [snip]
> > 
> > Agreed. Desiging a good generic solution would be better.
> > 
> > With that said...
> > 
> > Let's keep it out of this USB-functionfs interface for now. The
> > interface does work perfectly fine on platforms that don't have
> > coherency problems. The coherency issue in itself really is a
> > tangential issue.
> 
> Yeah, completely agree.
> 
> > So I will send a v6 where I don't try to force the cache coherency -
> > and instead assume that the attached devices are coherent between
> > themselves.
> > 
> > But it would be even better to have a way to detect non-coherency and
> > return an error on attach.
> 
> Take a look into the DMA subsystem. I'm pretty sure we already have
> something like this in there.
> 
> If nothing else helps you could take a look if the coherent memory access
> mask is non zero or something like that.

Jumping in way late and apolgies to everyone since yes I indeed suggested
this entire mess to Paul in some private thread.

And worse, I think we need it, it's just that we got away without it thus
far.

So way back at the og dma-buf kick-off dma coherency was discussed, and a
few things where noted:
- the dma api only supports device<->cpu coherency
- getting the full coherency model off the ground right away is probably
  too hard, so we made the decision that where it matters, relevant
  flushing needs to be done in dma_buf_map/unmap.

If you look at the earliest patches for dma-buf we had pretty clear
language that all dma-operations should be bracketed with map/unmap. Of
course that didn't work out for drm at all, and we had to first get
dma_resv_lock and dma_fence landed and then your dynamic exporter/importer
support in just to get the buffer migration functionality working, which
was only one of the things discussed that braketing 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-25 Thread Christian König

Am 24.01.24 um 11:58 schrieb Paul Cercueil:

[SNIP]
  
The problem was then that dma_buf_unmap_attachment cannot be called

before the dma_fence is signaled, and calling it after is already
too
late (because the fence would be signaled before the data is
sync'd).
  
  
  Well what sync are you talking about? CPU sync? In DMA-buf that is

handled differently.
  
  For importers it's mandatory that they can be coherent with the

exporter. That usually means they can snoop the CPU cache if the
exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache
and the other cannot. Therefore if I want to support it properly, I do
need cache flush/sync. I don't actually try to access the data using
the CPU (and when I do, I call the sync start/end ioctls).


Usually that isn't a problem as long as you don't access the data with 
the CPU.


[SNIP]


(and I *think* there is a way to force coherency in the
Ultrascale's
interconnect - we're investigating it)
  
  
  What you can do is that instead of using udmabuf or dma-heaps is

that the device which can't provide coherency act as exporters of the
buffers.
  
  The exporter is allowed to call sync_for_cpu/sync_for_device on it's

own buffers and also gets begin/end CPU access notfications. So you
can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call
begin_cpu_access() / end_cpu_access(), which they don't, because they
don't actually access the data using the CPU.


Wow, that is a completely new use case then.

Neither DMA-buf nor the DMA subsystem in Linux actually supports this as 
far as I can see.



Unless you mean that the exporter can call sync_for_cpu/sync_for_device
before/after every single DMA transfer so that the data appears
coherent to the importers, without them having to call
begin_cpu_access() / end_cpu_access().


Yeah, I mean the importers don't have to call begin_cpu_access() / 
end_cpu_access() if they don't do CPU access :)


What you can still do as exporter is to call sync_for_device() and 
sync_for_cpu() before and after each operation on your non-coherent 
device. Paired with the fence signaling that should still work fine then.


But taking a step back, this use case is not something even the low 
level DMA subsystem supports. That sync_for_cpu() does the right thing 
is coincident and not proper engineering.


What you need is a sync_device_to_device() which does the appropriate 
actions depending on which devices are involved.



In which case - this would still demultiply the complexity; my USB-
functionfs interface here (and IIO interface in the separate patchset)
are not device-specific, so I'd rather keep them importers.
  

  If you really don't have coherency between devices then that would
be a really new use case and we would need much more agreement on how
to do this.

[snip]

Agreed. Desiging a good generic solution would be better.

With that said...

Let's keep it out of this USB-functionfs interface for now. The
interface does work perfectly fine on platforms that don't have
coherency problems. The coherency issue in itself really is a
tangential issue.


Yeah, completely agree.


So I will send a v6 where I don't try to force the cache coherency -
and instead assume that the attached devices are coherent between
themselves.

But it would be even better to have a way to detect non-coherency and
return an error on attach.


Take a look into the DMA subsystem. I'm pretty sure we already have 
something like this in there.


If nothing else helps you could take a look if the coherent memory 
access mask is non zero or something like that.


Regards,
Christian.



Cheers,
-Paul


Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-24 Thread Paul Cercueil
Hi Andrew,

Le mercredi 24 janvier 2024 à 09:38 -0600, Andrew Davis a écrit :
> On 1/24/24 4:58 AM, Paul Cercueil wrote:
> > Hi Christian,
> > 
> > Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :
> > >   Am 23.01.24 um 14:02 schrieb Paul Cercueil:
> > >   
> > > > [SNIP]
> > > >   
> > > > >   
> > > > > >    
> > > > > > >   
> > > > > > > That an exporter has to call extra functions to access
> > > > > > > his
> > > > > > > own
> > > > > > > buffers
> > > > > > > is a complete no-go for the design since this forces
> > > > > > > exporters
> > > > > > > into
> > > > > > > doing extra steps for allowing importers to access their
> > > > > > > data.
> > > > > > >   
> > > > > >   
> > > > > > Then what about we add these dma_buf_{begin,end}_access(),
> > > > > > with
> > > > > > only
> > > > > > implementations for "dumb" exporters e.g. udmabuf or the
> > > > > > dmabuf
> > > > > > heaps?
> > > > > > And only importers (who cache the mapping and actually care
> > > > > > about
> > > > > > non-
> > > > > > coherency) would have to call these.
> > > > > >   
> > > > >   
> > > > > No, the problem is still that you would have to change all
> > > > > importers
> > > > > to
> > > > > mandatory use dma_buf_begin/end.
> > > > > 
> > > > > But going a step back caching the mapping is irrelevant for
> > > > > coherency.
> > > > > Even if you don't cache the mapping you don't get coherency.
> > > > >   
> > > >   
> > > > You actually do - at least with udmabuf, as in that case
> > > > dma_buf_map_attachment() / dma_buf_unmap_attachment() will
> > > > handle
> > > > cache
> > > > coherency when the SGs are mapped/unmapped.
> > > >   
> > >   
> > >   Well I just double checked the source in 6.7.1 and I can't see
> > > udmabuf doing anything for cache coherency in map/unmap.
> > >   
> > >   All it does is calling dma_map_sgtable() and
> > > dma_unmap_sgtable() to
> > > create and destroy the SG table and those are not supposed to
> > > sync
> > > anything to the CPU cache.
> > >   
> > >   In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here,
> > > it's
> > > just that this is missing from udmabuf.
> > 
> > Ok.
> >   
> > > >   
> > > > The problem was then that dma_buf_unmap_attachment cannot be
> > > > called
> > > > before the dma_fence is signaled, and calling it after is
> > > > already
> > > > too
> > > > late (because the fence would be signaled before the data is
> > > > sync'd).
> > > >   
> > >   
> > >   Well what sync are you talking about? CPU sync? In DMA-buf that
> > > is
> > > handled differently.
> > >   
> > >   For importers it's mandatory that they can be coherent with the
> > > exporter. That usually means they can snoop the CPU cache if the
> > > exporter can snoop the CPU cache.
> > 
> > I seem to have such a system where one device can snoop the CPU
> > cache
> > and the other cannot. Therefore if I want to support it properly, I
> > do
> > need cache flush/sync. I don't actually try to access the data
> > using
> > the CPU (and when I do, I call the sync start/end ioctls).
> > 
> 
> If you don't access the data using the CPU, then how did the data
> end up in the CPU caches? If you have a device that can write-
> allocate
> into your CPU cache, but some other device in the system cannot snoop
> that data back out then that is just broken and those devices cannot
> reasonably share buffers..

I think that's what happens, yes.

> Now we do have systems where some hardware can snoop CPU(or L3)
> caches
> and others cannot, but they will not *allocate* into those caches
> (unless they also have the ability to sync them without CPU in the
> loop).
> 
> Your problem may be if you are still using udmabuf driver as your
> DMA-BUF exporter, which as said before is broken (and I just sent
> some
> patches with a few fixes just for you :)). For udmabuf, data starts
> in the CPU domain (in caches) and is only ever synced for the CPU,
> not for attached devices. So in this case the writing device might
> update those cache lines but a non-snooping reader would never see
> those updates.

I tried today with the system dma-heap, and the behaviour was the same.
Adding an implementation of .dma_buf_begin/end_access() to it made it
work there too.

> I'm not saying there isn't a need for these new {begin,end}_access()
> functions. I can think of a few interesting usecases, but as you
> say below that would be good to work out in a different series.

Yep, but it's a can of worms I'd rather not open if I can avoid it :)

> Andrew

Cheers,
-Paul

> 
> > 
> > >   For exporters you can implement the begin/end CPU access
> > > functions
> > > which allows you to implement something even if your exporting
> > > device
> > > can't snoop the CPU cache.
> > 
> > That only works if the importers call the begin_cpu_access() /
> > end_cpu_access(), which they don't.
> > 
> >   
> > > > Daniel / Sima suggested then that I cache the mapping and add
> > > > new
> > > > functions to ensure cache 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-24 Thread Andrew Davis

On 1/24/24 4:58 AM, Paul Cercueil wrote:

Hi Christian,

Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :

  Am 23.01.24 um 14:02 schrieb Paul Cercueil:
  

[SNIP]
  
  
   
  
That an exporter has to call extra functions to access his

own
buffers
is a complete no-go for the design since this forces
exporters
into
doing extra steps for allowing importers to access their
data.
  
  
Then what about we add these dma_buf_{begin,end}_access(), with

only
implementations for "dumb" exporters e.g. udmabuf or the dmabuf
heaps?
And only importers (who cache the mapping and actually care
about
non-
coherency) would have to call these.
  
  
No, the problem is still that you would have to change all

importers
to
mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for
coherency.
Even if you don't cache the mapping you don't get coherency.
  
  
You actually do - at least with udmabuf, as in that case

dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle
cache
coherency when the SGs are mapped/unmapped.
  
  
  Well I just double checked the source in 6.7.1 and I can't see

udmabuf doing anything for cache coherency in map/unmap.
  
  All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to

create and destroy the SG table and those are not supposed to sync
anything to the CPU cache.
  
  In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's

just that this is missing from udmabuf.


Ok.
  
  
The problem was then that dma_buf_unmap_attachment cannot be called

before the dma_fence is signaled, and calling it after is already
too
late (because the fence would be signaled before the data is
sync'd).
  
  
  Well what sync are you talking about? CPU sync? In DMA-buf that is

handled differently.
  
  For importers it's mandatory that they can be coherent with the

exporter. That usually means they can snoop the CPU cache if the
exporter can snoop the CPU cache.


I seem to have such a system where one device can snoop the CPU cache
and the other cannot. Therefore if I want to support it properly, I do
need cache flush/sync. I don't actually try to access the data using
the CPU (and when I do, I call the sync start/end ioctls).



If you don't access the data using the CPU, then how did the data
end up in the CPU caches? If you have a device that can write-allocate
into your CPU cache, but some other device in the system cannot snoop
that data back out then that is just broken and those devices cannot
reasonably share buffers..

Now we do have systems where some hardware can snoop CPU(or L3) caches
and others cannot, but they will not *allocate* into those caches
(unless they also have the ability to sync them without CPU in the loop).

Your problem may be if you are still using udmabuf driver as your
DMA-BUF exporter, which as said before is broken (and I just sent some
patches with a few fixes just for you :)). For udmabuf, data starts
in the CPU domain (in caches) and is only ever synced for the CPU,
not for attached devices. So in this case the writing device might
update those cache lines but a non-snooping reader would never see
those updates.

I'm not saying there isn't a need for these new {begin,end}_access()
functions. I can think of a few interesting usecases, but as you
say below that would be good to work out in a different series.

Andrew




  For exporters you can implement the begin/end CPU access functions
which allows you to implement something even if your exporting device
can't snoop the CPU cache.


That only works if the importers call the begin_cpu_access() /
end_cpu_access(), which they don't.

  

Daniel / Sima suggested then that I cache the mapping and add new
functions to ensure cache coherency, which is what these patches
are
about.
  
  
  Yeah, I've now catched up on the latest mail. Sorry I haven't seen

that before.
  
  
  

  
  
In other words exporters are not require to call sync_to_cpu or

sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency
matters?
  
  
My use-case is, I create DMABUFs with udmabuf, that I attach to

USB/functionfs with the interface introduced by this patchset. I
attach
them to IIO with a similar interface (being upstreamed in
parallel),
and transfer data from USB to IIO and vice-versa in a zero-copy
fashion.

This works perfectly fine as long as the USB and IIO hardware are
coherent between themselves, which is the case on most of our
boards.
However I do have a board (with a Xilinx Ultrascale SoC) where it
is
not the case, and cache flushes/sync are needed. So I was trying to
rework these new interfaces to work on that system too.
  
  
  Yeah, that sounds strongly like one of the use cases we have

rejected so far.
  
  
  
  
If this really is a no-no, then I am fine with the assumption that

devices sharing a DMABUF must be coherent between themselves; but
that's something that should probably be enforced rather 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-24 Thread Paul Cercueil
Hi Christian,

Le mardi 23 janvier 2024 à 14:28 +0100, Christian König a écrit :
>  Am 23.01.24 um 14:02 schrieb Paul Cercueil:
>  
> > [SNIP]
> >  
> > >  
> > > >   
> > > > >  
> > > > > That an exporter has to call extra functions to access his
> > > > > own
> > > > > buffers
> > > > > is a complete no-go for the design since this forces
> > > > > exporters
> > > > > into
> > > > > doing extra steps for allowing importers to access their
> > > > > data.
> > > > >  
> > > >  
> > > > Then what about we add these dma_buf_{begin,end}_access(), with
> > > > only
> > > > implementations for "dumb" exporters e.g. udmabuf or the dmabuf
> > > > heaps?
> > > > And only importers (who cache the mapping and actually care
> > > > about
> > > > non-
> > > > coherency) would have to call these.
> > > >  
> > >  
> > > No, the problem is still that you would have to change all
> > > importers
> > > to 
> > > mandatory use dma_buf_begin/end.
> > > 
> > > But going a step back caching the mapping is irrelevant for
> > > coherency. 
> > > Even if you don't cache the mapping you don't get coherency.
> > >  
> >  
> > You actually do - at least with udmabuf, as in that case
> > dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle
> > cache
> > coherency when the SGs are mapped/unmapped.
> >  
>  
>  Well I just double checked the source in 6.7.1 and I can't see
> udmabuf doing anything for cache coherency in map/unmap.
>  
>  All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to
> create and destroy the SG table and those are not supposed to sync
> anything to the CPU cache.
>  
>  In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's
> just that this is missing from udmabuf. 

Ok.
 
> >  
> > The problem was then that dma_buf_unmap_attachment cannot be called
> > before the dma_fence is signaled, and calling it after is already
> > too
> > late (because the fence would be signaled before the data is
> > sync'd).
> >  
>  
>  Well what sync are you talking about? CPU sync? In DMA-buf that is
> handled differently.
>  
>  For importers it's mandatory that they can be coherent with the
> exporter. That usually means they can snoop the CPU cache if the
> exporter can snoop the CPU cache.

I seem to have such a system where one device can snoop the CPU cache
and the other cannot. Therefore if I want to support it properly, I do
need cache flush/sync. I don't actually try to access the data using
the CPU (and when I do, I call the sync start/end ioctls).


>  For exporters you can implement the begin/end CPU access functions
> which allows you to implement something even if your exporting device
> can't snoop the CPU cache.

That only works if the importers call the begin_cpu_access() /
end_cpu_access(), which they don't.

 
> > Daniel / Sima suggested then that I cache the mapping and add new
> > functions to ensure cache coherency, which is what these patches
> > are
> > about.
> >  
>  
>  Yeah, I've now catched up on the latest mail. Sorry I haven't seen
> that before.
>  
>  
> >  
> > 
> >  
> > >  
> > > In other words exporters are not require to call sync_to_cpu or 
> > > sync_to_device when you create a mapping.
> > > 
> > > What exactly is your use case here? And why does coherency
> > > matters?
> > >  
> >  
> > My use-case is, I create DMABUFs with udmabuf, that I attach to
> > USB/functionfs with the interface introduced by this patchset. I
> > attach
> > them to IIO with a similar interface (being upstreamed in
> > parallel),
> > and transfer data from USB to IIO and vice-versa in a zero-copy
> > fashion.
> > 
> > This works perfectly fine as long as the USB and IIO hardware are
> > coherent between themselves, which is the case on most of our
> > boards.
> > However I do have a board (with a Xilinx Ultrascale SoC) where it
> > is
> > not the case, and cache flushes/sync are needed. So I was trying to
> > rework these new interfaces to work on that system too.
> >  
>  
>  Yeah, that sounds strongly like one of the use cases we have
> rejected so far.
>  
>  
>  
> >  
> > If this really is a no-no, then I am fine with the assumption that
> > devices sharing a DMABUF must be coherent between themselves; but
> > that's something that should probably be enforced rather than
> > assumed.
> > 
> > (and I *think* there is a way to force coherency in the
> > Ultrascale's
> > interconnect - we're investigating it)
> >  
>  
>  What you can do is that instead of using udmabuf or dma-heaps is
> that the device which can't provide coherency act as exporters of the
> buffers.
>  
>  The exporter is allowed to call sync_for_cpu/sync_for_device on it's
> own buffers and also gets begin/end CPU access notfications. So you
> can then handle coherency between the exporter and the CPU.

But again that would only work if the importers would call
begin_cpu_access() / end_cpu_access(), which they don't, because they
don't actually access the data using the CPU.

Unless you mean that the 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-23 Thread Christian König

Am 23.01.24 um 14:02 schrieb Paul Cercueil:

[SNIP]

That an exporter has to call extra functions to access his own
buffers
is a complete no-go for the design since this forces exporters
into
doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with
only
implementations for "dumb" exporters e.g. udmabuf or the dmabuf
heaps?
And only importers (who cache the mapping and actually care about
non-
coherency) would have to call these.

No, the problem is still that you would have to change all importers
to
mandatory use dma_buf_begin/end.

But going a step back caching the mapping is irrelevant for
coherency.
Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case
dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache
coherency when the SGs are mapped/unmapped.


Well I just double checked the source in 6.7.1 and I can't see udmabuf 
doing anything for cache coherency in map/unmap.


All it does is calling dma_map_sgtable() and dma_unmap_sgtable() to 
create and destroy the SG table and those are not supposed to sync 
anything to the CPU cache.


In other words drivers usually use DMA_ATTR_SKIP_CPU_SYNC here, it's 
just that this is missing from udmabuf.



The problem was then that dma_buf_unmap_attachment cannot be called
before the dma_fence is signaled, and calling it after is already too
late (because the fence would be signaled before the data is sync'd).


Well what sync are you talking about? CPU sync? In DMA-buf that is 
handled differently.


For importers it's mandatory that they can be coherent with the 
exporter. That usually means they can snoop the CPU cache if the 
exporter can snoop the CPU cache.


For exporters you can implement the begin/end CPU access functions which 
allows you to implement something even if your exporting device can't 
snoop the CPU cache.



Daniel / Sima suggested then that I cache the mapping and add new
functions to ensure cache coherency, which is what these patches are
about.


Yeah, I've now catched up on the latest mail. Sorry I haven't seen that 
before.





In other words exporters are not require to call sync_to_cpu or
sync_to_device when you create a mapping.

What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to
USB/functionfs with the interface introduced by this patchset. I attach
them to IIO with a similar interface (being upstreamed in parallel),
and transfer data from USB to IIO and vice-versa in a zero-copy
fashion.

This works perfectly fine as long as the USB and IIO hardware are
coherent between themselves, which is the case on most of our boards.
However I do have a board (with a Xilinx Ultrascale SoC) where it is
not the case, and cache flushes/sync are needed. So I was trying to
rework these new interfaces to work on that system too.


Yeah, that sounds strongly like one of the use cases we have rejected so 
far.



If this really is a no-no, then I am fine with the assumption that
devices sharing a DMABUF must be coherent between themselves; but
that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's
interconnect - we're investigating it)


What you can do is that instead of using udmabuf or dma-heaps is that 
the device which can't provide coherency act as exporters of the buffers.


The exporter is allowed to call sync_for_cpu/sync_for_device on it's own 
buffers and also gets begin/end CPU access notfications. So you can then 
handle coherency between the exporter and the CPU.


If you really don't have coherency between devices then that would be a 
really new use case and we would need much more agreement on how to do this.


Regards,
Christian.



Cheers,
-Paul


At the very least, is there a way to check that "the data can be
accessed coherently by the involved devices"? So that my importer
can
EPERM if there is no coherency vs. a device that's already
attached.

Yeah, there is functionality for this in the DMA subsystem. I've once
created prototype patches for enforcing the same coherency approach
between importer and exporter, but we never got around to upstream
them.




Cheers,
-Paul


That in turn is pretty much un-testable unless you have every
possible
importer around while testing the exporter.

Regards,
Christian.


Regards,
Christian.

Cheers,
-Paul


Signed-off-by: Paul Cercueil

---
v5: New patch
---
     drivers/dma-buf/dma-buf.c | 66
+++
     include/linux/dma-buf.h   | 37 ++
     2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-
buf/dma-
buf.c
index 8fe5aa67b167..a8bab6c18fcd 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -830,6 +830,8 @@ static struct sg_table *
__map_dma_buf(struct

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-23 Thread Paul Cercueil
Le mardi 23 janvier 2024 à 12:52 +0100, Christian König a écrit :
> Am 23.01.24 um 11:10 schrieb Paul Cercueil:
> > Hi Christian,
> > 
> > Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :
> > > Am 22.01.24 um 12:01 schrieb Paul Cercueil:
> > > > Hi Christian,
> > > > 
> > > > Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a
> > > > écrit :
> > > > > Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> > > > > > These functions should be used by device drivers when they
> > > > > > start
> > > > > > and
> > > > > > stop accessing the data of DMABUF. It allows DMABUF
> > > > > > importers
> > > > > > to
> > > > > > cache
> > > > > > the dma_buf_attachment while ensuring that the data they
> > > > > > want
> > > > > > to
> > > > > > access
> > > > > > is available for their device when the DMA transfers take
> > > > > > place.
> > > > > As Daniel already noted as well this is a complete no-go from
> > > > > the
> > > > > DMA-buf design point of view.
> > > > What do you mean "as Daniel already noted"? It was him who
> > > > suggested
> > > > this.
> > > Sorry, I haven't fully catched up to the discussion then.
> > > 
> > > In general DMA-buf is build around the idea that the data can be
> > > accessed coherently by the involved devices.
> > > 
> > > Having a begin/end of access for devices was brought up multiple
> > > times
> > > but so far rejected for good reasons.
> > I would argue that if it was brought up multiple times, then there
> > are
> > also good reasons to support such a mechanism.
> > 
> > > That an exporter has to call extra functions to access his own
> > > buffers
> > > is a complete no-go for the design since this forces exporters
> > > into
> > > doing extra steps for allowing importers to access their data.
> > Then what about we add these dma_buf_{begin,end}_access(), with
> > only
> > implementations for "dumb" exporters e.g. udmabuf or the dmabuf
> > heaps?
> > And only importers (who cache the mapping and actually care about
> > non-
> > coherency) would have to call these.
> 
> No, the problem is still that you would have to change all importers
> to 
> mandatory use dma_buf_begin/end.
> 
> But going a step back caching the mapping is irrelevant for
> coherency. 
> Even if you don't cache the mapping you don't get coherency.

You actually do - at least with udmabuf, as in that case
dma_buf_map_attachment() / dma_buf_unmap_attachment() will handle cache
coherency when the SGs are mapped/unmapped.

The problem was then that dma_buf_unmap_attachment cannot be called
before the dma_fence is signaled, and calling it after is already too
late (because the fence would be signaled before the data is sync'd).

Daniel / Sima suggested then that I cache the mapping and add new
functions to ensure cache coherency, which is what these patches are
about.

> In other words exporters are not require to call sync_to_cpu or 
> sync_to_device when you create a mapping.
> 
> What exactly is your use case here? And why does coherency matters?

My use-case is, I create DMABUFs with udmabuf, that I attach to
USB/functionfs with the interface introduced by this patchset. I attach
them to IIO with a similar interface (being upstreamed in parallel),
and transfer data from USB to IIO and vice-versa in a zero-copy
fashion.

This works perfectly fine as long as the USB and IIO hardware are
coherent between themselves, which is the case on most of our boards.
However I do have a board (with a Xilinx Ultrascale SoC) where it is
not the case, and cache flushes/sync are needed. So I was trying to
rework these new interfaces to work on that system too.

If this really is a no-no, then I am fine with the assumption that
devices sharing a DMABUF must be coherent between themselves; but
that's something that should probably be enforced rather than assumed.

(and I *think* there is a way to force coherency in the Ultrascale's
interconnect - we're investigating it)

Cheers,
-Paul

> > At the very least, is there a way to check that "the data can be
> > accessed coherently by the involved devices"? So that my importer
> > can
> > EPERM if there is no coherency vs. a device that's already
> > attached.
> 
> Yeah, there is functionality for this in the DMA subsystem. I've once
> created prototype patches for enforcing the same coherency approach 
> between importer and exporter, but we never got around to upstream
> them.
> 
> 
> 
> > 
> > Cheers,
> > -Paul
> > 
> > > That in turn is pretty much un-testable unless you have every
> > > possible
> > > importer around while testing the exporter.
> > > 
> > > Regards,
> > > Christian.
> > > 
> > > > > Regards,
> > > > > Christian.
> > > > Cheers,
> > > > -Paul
> > > > 
> > > > > > Signed-off-by: Paul Cercueil 
> > > > > > 
> > > > > > ---
> > > > > > v5: New patch
> > > > > > ---
> > > > > >     drivers/dma-buf/dma-buf.c | 66
> > > > > > +++
> > > > > >     include/linux/dma-buf.h   | 37 ++
> > > > > >   

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-23 Thread Christian König

Am 23.01.24 um 11:10 schrieb Paul Cercueil:

Hi Christian,

Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :

Am 22.01.24 um 12:01 schrieb Paul Cercueil:

Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

Am 19.01.24 um 15:13 schrieb Paul Cercueil:

These functions should be used by device drivers when they
start
and
stop accessing the data of DMABUF. It allows DMABUF importers
to
cache
the dma_buf_attachment while ensuring that the data they want
to
access
is available for their device when the DMA transfers take
place.

As Daniel already noted as well this is a complete no-go from the
DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who
suggested
this.

Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be
accessed coherently by the involved devices.

Having a begin/end of access for devices was brought up multiple
times
but so far rejected for good reasons.

I would argue that if it was brought up multiple times, then there are
also good reasons to support such a mechanism.


That an exporter has to call extra functions to access his own
buffers
is a complete no-go for the design since this forces exporters into
doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only
implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps?
And only importers (who cache the mapping and actually care about non-
coherency) would have to call these.


No, the problem is still that you would have to change all importers to 
mandatory use dma_buf_begin/end.


But going a step back caching the mapping is irrelevant for coherency. 
Even if you don't cache the mapping you don't get coherency.


In other words exporters are not require to call sync_to_cpu or 
sync_to_device when you create a mapping.


What exactly is your use case here? And why does coherency matters?


At the very least, is there a way to check that "the data can be
accessed coherently by the involved devices"? So that my importer can
EPERM if there is no coherency vs. a device that's already attached.


Yeah, there is functionality for this in the DMA subsystem. I've once 
created prototype patches for enforcing the same coherency approach 
between importer and exporter, but we never got around to upstream them.






Cheers,
-Paul


That in turn is pretty much un-testable unless you have every
possible
importer around while testing the exporter.

Regards,
Christian.


Regards,
Christian.

Cheers,
-Paul


Signed-off-by: Paul Cercueil 

---
v5: New patch
---
    drivers/dma-buf/dma-buf.c | 66
+++
    include/linux/dma-buf.h   | 37 ++
    2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-
buf.c
index 8fe5aa67b167..a8bab6c18fcd 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -830,6 +830,8 @@ static struct sg_table *
__map_dma_buf(struct
dma_buf_attachment *attach,
     * - dma_buf_mmap()
     * - dma_buf_begin_cpu_access()
     * - dma_buf_end_cpu_access()
+ * - dma_buf_begin_access()
+ * - dma_buf_end_access()
     * - dma_buf_map_attachment_unlocked()
     * - dma_buf_unmap_attachment_unlocked()
     * - dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct
dma_buf
*dmabuf, struct iosys_map *map)
    }
    EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);

+/**

+ * @dma_buf_begin_access - Call before any hardware access
from/to
the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]    direction of DMA transfer
+ */
+int dma_buf_begin_access(struct dma_buf_attachment *attach,
+struct sg_table *sgt, enum
dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->begin_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->begin_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
+
+/**
+ * @dma_buf_end_access - Call after any hardware access
from/to
the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]    direction of DMA transfer
+ */
+int dma_buf_end_access(struct dma_buf_attachment *attach,
+      struct sg_table *sgt, enum
dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-23 Thread Paul Cercueil
Hi Christian,

Le lundi 22 janvier 2024 à 14:41 +0100, Christian König a écrit :
> Am 22.01.24 um 12:01 schrieb Paul Cercueil:
> > Hi Christian,
> > 
> > Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :
> > > Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> > > > These functions should be used by device drivers when they
> > > > start
> > > > and
> > > > stop accessing the data of DMABUF. It allows DMABUF importers
> > > > to
> > > > cache
> > > > the dma_buf_attachment while ensuring that the data they want
> > > > to
> > > > access
> > > > is available for their device when the DMA transfers take
> > > > place.
> > > As Daniel already noted as well this is a complete no-go from the
> > > DMA-buf design point of view.
> > What do you mean "as Daniel already noted"? It was him who
> > suggested
> > this.
> 
> Sorry, I haven't fully catched up to the discussion then.
> 
> In general DMA-buf is build around the idea that the data can be 
> accessed coherently by the involved devices.
> 
> Having a begin/end of access for devices was brought up multiple
> times 
> but so far rejected for good reasons.

I would argue that if it was brought up multiple times, then there are
also good reasons to support such a mechanism.

> That an exporter has to call extra functions to access his own
> buffers 
> is a complete no-go for the design since this forces exporters into 
> doing extra steps for allowing importers to access their data.

Then what about we add these dma_buf_{begin,end}_access(), with only
implementations for "dumb" exporters e.g. udmabuf or the dmabuf heaps?
And only importers (who cache the mapping and actually care about non-
coherency) would have to call these.

At the very least, is there a way to check that "the data can be
accessed coherently by the involved devices"? So that my importer can
EPERM if there is no coherency vs. a device that's already attached.

Cheers,
-Paul

> That in turn is pretty much un-testable unless you have every
> possible 
> importer around while testing the exporter.
> 
> Regards,
> Christian.
> 
> > 
> > > Regards,
> > > Christian.
> > Cheers,
> > -Paul
> > 
> > > > Signed-off-by: Paul Cercueil 
> > > > 
> > > > ---
> > > > v5: New patch
> > > > ---
> > > >    drivers/dma-buf/dma-buf.c | 66
> > > > +++
> > > >    include/linux/dma-buf.h   | 37 ++
> > > >    2 files changed, 103 insertions(+)
> > > > 
> > > > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-
> > > > buf.c
> > > > index 8fe5aa67b167..a8bab6c18fcd 100644
> > > > --- a/drivers/dma-buf/dma-buf.c
> > > > +++ b/drivers/dma-buf/dma-buf.c
> > > > @@ -830,6 +830,8 @@ static struct sg_table *
> > > > __map_dma_buf(struct
> > > > dma_buf_attachment *attach,
> > > >     * - dma_buf_mmap()
> > > >     * - dma_buf_begin_cpu_access()
> > > >     * - dma_buf_end_cpu_access()
> > > > + * - dma_buf_begin_access()
> > > > + * - dma_buf_end_access()
> > > >     * - dma_buf_map_attachment_unlocked()
> > > >     * - dma_buf_unmap_attachment_unlocked()
> > > >     * - dma_buf_vmap_unlocked()
> > > > @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct
> > > > dma_buf
> > > > *dmabuf, struct iosys_map *map)
> > > >    }
> > > >    EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
> > > >    
> > > > +/**
> > > > + * @dma_buf_begin_access - Call before any hardware access
> > > > from/to
> > > > the DMABUF
> > > > + * @attach:[in]attachment used for hardware access
> > > > + * @sg_table:  [in]scatterlist used for the DMA transfer
> > > > + * @direction:  [in]    direction of DMA transfer
> > > > + */
> > > > +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> > > > +struct sg_table *sgt, enum
> > > > dma_data_direction dir)
> > > > +{
> > > > +   struct dma_buf *dmabuf;
> > > > +   bool cookie;
> > > > +   int ret;
> > > > +
> > > > +   if (WARN_ON(!attach))
> > > > +   return -EINVAL;
> > > > +
> > > > +   dmabuf = attach->dmabuf;
> > > > +
> > > > +   if (!dmabuf->ops->begin_access)
> > > > +   return 0;
> > > > +
> > > > +   cookie = dma_fence_begin_signalling();
> > > > +   ret = dmabuf->ops->begin_access(attach, sgt, dir);
> > > > +   dma_fence_end_signalling(cookie);
> > > > +
> > > > +   if (WARN_ON_ONCE(ret))
> > > > +   return ret;
> > > > +
> > > > +   return 0;
> > > > +}
> > > > +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
> > > > +
> > > > +/**
> > > > + * @dma_buf_end_access - Call after any hardware access
> > > > from/to
> > > > the DMABUF
> > > > + * @attach:[in]attachment used for hardware access
> > > > + * @sg_table:  [in]scatterlist used for the DMA transfer
> > > > + * @direction:  [in]    direction of DMA transfer
> > > > + */
> > > > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > > > +      struct 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-22 Thread Christian König

Am 22.01.24 um 12:01 schrieb Paul Cercueil:

Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :

Am 19.01.24 um 15:13 schrieb Paul Cercueil:

These functions should be used by device drivers when they start
and
stop accessing the data of DMABUF. It allows DMABUF importers to
cache
the dma_buf_attachment while ensuring that the data they want to
access
is available for their device when the DMA transfers take place.

As Daniel already noted as well this is a complete no-go from the
DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested
this.


Sorry, I haven't fully catched up to the discussion then.

In general DMA-buf is build around the idea that the data can be 
accessed coherently by the involved devices.


Having a begin/end of access for devices was brought up multiple times 
but so far rejected for good reasons.


That an exporter has to call extra functions to access his own buffers 
is a complete no-go for the design since this forces exporters into 
doing extra steps for allowing importers to access their data.


That in turn is pretty much un-testable unless you have every possible 
importer around while testing the exporter.


Regards,
Christian.




Regards,
Christian.

Cheers,
-Paul


Signed-off-by: Paul Cercueil 

---
v5: New patch
---
   drivers/dma-buf/dma-buf.c | 66
+++
   include/linux/dma-buf.h   | 37 ++
   2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..a8bab6c18fcd 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct
dma_buf_attachment *attach,
    * - dma_buf_mmap()
    * - dma_buf_begin_cpu_access()
    * - dma_buf_end_cpu_access()
+ * - dma_buf_begin_access()
+ * - dma_buf_end_access()
    * - dma_buf_map_attachment_unlocked()
    * - dma_buf_unmap_attachment_unlocked()
    * - dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf
*dmabuf, struct iosys_map *map)
   }
   EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
   
+/**

+ * @dma_buf_begin_access - Call before any hardware access from/to
the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]    direction of DMA transfer
+ */
+int dma_buf_begin_access(struct dma_buf_attachment *attach,
+struct sg_table *sgt, enum
dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->begin_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->begin_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
+
+/**
+ * @dma_buf_end_access - Call after any hardware access from/to
the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]    direction of DMA transfer
+ */
+int dma_buf_end_access(struct dma_buf_attachment *attach,
+      struct sg_table *sgt, enum
dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->end_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->end_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
+
   #ifdef CONFIG_DEBUG_FS
   static int dma_buf_debug_show(struct seq_file *s, void *unused)
   {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8ff4add71f88..8ba612c7cc16 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -246,6 +246,38 @@ struct dma_buf_ops {
     */
    int (*end_cpu_access)(struct dma_buf *, enum
dma_data_direction);
   
+	/**

+* @begin_access:
+*
+* This is called from dma_buf_begin_access() when a
device driver
+* wants to access the data of the DMABUF. The exporter
can use this
+* to flush/sync the caches if needed.
+*
+* This callback is optional.
+*
+* Returns:
+*
+* 0 on success or a negative error code on failure.
+*/
+   int (*begin_access)(struct dma_buf_attachment *, struct
sg_table *,
+       

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-22 Thread Paul Cercueil
Hi Christian,

Le lundi 22 janvier 2024 à 11:35 +0100, Christian König a écrit :
> Am 19.01.24 um 15:13 schrieb Paul Cercueil:
> > These functions should be used by device drivers when they start
> > and
> > stop accessing the data of DMABUF. It allows DMABUF importers to
> > cache
> > the dma_buf_attachment while ensuring that the data they want to
> > access
> > is available for their device when the DMA transfers take place.
> 
> As Daniel already noted as well this is a complete no-go from the 
> DMA-buf design point of view.

What do you mean "as Daniel already noted"? It was him who suggested
this.

> 
> Regards,
> Christian.

Cheers,
-Paul

> 
> > 
> > Signed-off-by: Paul Cercueil 
> > 
> > ---
> > v5: New patch
> > ---
> >   drivers/dma-buf/dma-buf.c | 66
> > +++
> >   include/linux/dma-buf.h   | 37 ++
> >   2 files changed, 103 insertions(+)
> > 
> > diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> > index 8fe5aa67b167..a8bab6c18fcd 100644
> > --- a/drivers/dma-buf/dma-buf.c
> > +++ b/drivers/dma-buf/dma-buf.c
> > @@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct
> > dma_buf_attachment *attach,
> >    * - dma_buf_mmap()
> >    * - dma_buf_begin_cpu_access()
> >    * - dma_buf_end_cpu_access()
> > + * - dma_buf_begin_access()
> > + * - dma_buf_end_access()
> >    * - dma_buf_map_attachment_unlocked()
> >    * - dma_buf_unmap_attachment_unlocked()
> >    * - dma_buf_vmap_unlocked()
> > @@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf
> > *dmabuf, struct iosys_map *map)
> >   }
> >   EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
> >   
> > +/**
> > + * @dma_buf_begin_access - Call before any hardware access from/to
> > the DMABUF
> > + * @attach:[in]attachment used for hardware access
> > + * @sg_table:  [in]scatterlist used for the DMA transfer
> > + * @direction:  [in]    direction of DMA transfer
> > + */
> > +int dma_buf_begin_access(struct dma_buf_attachment *attach,
> > +struct sg_table *sgt, enum
> > dma_data_direction dir)
> > +{
> > +   struct dma_buf *dmabuf;
> > +   bool cookie;
> > +   int ret;
> > +
> > +   if (WARN_ON(!attach))
> > +   return -EINVAL;
> > +
> > +   dmabuf = attach->dmabuf;
> > +
> > +   if (!dmabuf->ops->begin_access)
> > +   return 0;
> > +
> > +   cookie = dma_fence_begin_signalling();
> > +   ret = dmabuf->ops->begin_access(attach, sgt, dir);
> > +   dma_fence_end_signalling(cookie);
> > +
> > +   if (WARN_ON_ONCE(ret))
> > +   return ret;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
> > +
> > +/**
> > + * @dma_buf_end_access - Call after any hardware access from/to
> > the DMABUF
> > + * @attach:[in]attachment used for hardware access
> > + * @sg_table:  [in]scatterlist used for the DMA transfer
> > + * @direction:  [in]    direction of DMA transfer
> > + */
> > +int dma_buf_end_access(struct dma_buf_attachment *attach,
> > +      struct sg_table *sgt, enum
> > dma_data_direction dir)
> > +{
> > +   struct dma_buf *dmabuf;
> > +   bool cookie;
> > +   int ret;
> > +
> > +   if (WARN_ON(!attach))
> > +   return -EINVAL;
> > +
> > +   dmabuf = attach->dmabuf;
> > +
> > +   if (!dmabuf->ops->end_access)
> > +   return 0;
> > +
> > +   cookie = dma_fence_begin_signalling();
> > +   ret = dmabuf->ops->end_access(attach, sgt, dir);
> > +   dma_fence_end_signalling(cookie);
> > +
> > +   if (WARN_ON_ONCE(ret))
> > +   return ret;
> > +
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
> > +
> >   #ifdef CONFIG_DEBUG_FS
> >   static int dma_buf_debug_show(struct seq_file *s, void *unused)
> >   {
> > diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> > index 8ff4add71f88..8ba612c7cc16 100644
> > --- a/include/linux/dma-buf.h
> > +++ b/include/linux/dma-buf.h
> > @@ -246,6 +246,38 @@ struct dma_buf_ops {
> >      */
> >     int (*end_cpu_access)(struct dma_buf *, enum
> > dma_data_direction);
> >   
> > +   /**
> > +* @begin_access:
> > +*
> > +* This is called from dma_buf_begin_access() when a
> > device driver
> > +* wants to access the data of the DMABUF. The exporter
> > can use this
> > +* to flush/sync the caches if needed.
> > +*
> > +* This callback is optional.
> > +*
> > +* Returns:
> > +*
> > +* 0 on success or a negative error code on failure.
> > +*/
> > +   int (*begin_access)(struct dma_buf_attachment *, struct
> > sg_table *,
> > +       enum dma_data_direction);
> > +
> > +   /**
> > +* @end_access:
> > +*
> > +* This is called from dma_buf_end_access() when a device
> > driver is
> > +* done accessing the data of the DMABUF. The exporter can
> > use this
> > +* to flush/sync the caches if needed.
> > +*
> > +* 

Re: [Linaro-mm-sig] [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-22 Thread Christian König

Am 19.01.24 um 15:13 schrieb Paul Cercueil:

These functions should be used by device drivers when they start and
stop accessing the data of DMABUF. It allows DMABUF importers to cache
the dma_buf_attachment while ensuring that the data they want to access
is available for their device when the DMA transfers take place.


As Daniel already noted as well this is a complete no-go from the 
DMA-buf design point of view.


Regards,
Christian.



Signed-off-by: Paul Cercueil 

---
v5: New patch
---
  drivers/dma-buf/dma-buf.c | 66 +++
  include/linux/dma-buf.h   | 37 ++
  2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..a8bab6c18fcd 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct 
dma_buf_attachment *attach,
   * - dma_buf_mmap()
   * - dma_buf_begin_cpu_access()
   * - dma_buf_end_cpu_access()
+ * - dma_buf_begin_access()
+ * - dma_buf_end_access()
   * - dma_buf_map_attachment_unlocked()
   * - dma_buf_unmap_attachment_unlocked()
   * - dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, 
struct iosys_map *map)
  }
  EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
  
+/**

+ * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]direction of DMA transfer
+ */
+int dma_buf_begin_access(struct dma_buf_attachment *attach,
+struct sg_table *sgt, enum dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->begin_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->begin_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
+
+/**
+ * @dma_buf_end_access - Call after any hardware access from/to the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]direction of DMA transfer
+ */
+int dma_buf_end_access(struct dma_buf_attachment *attach,
+  struct sg_table *sgt, enum dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->end_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->end_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
+
  #ifdef CONFIG_DEBUG_FS
  static int dma_buf_debug_show(struct seq_file *s, void *unused)
  {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8ff4add71f88..8ba612c7cc16 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -246,6 +246,38 @@ struct dma_buf_ops {
 */
int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
  
+	/**

+* @begin_access:
+*
+* This is called from dma_buf_begin_access() when a device driver
+* wants to access the data of the DMABUF. The exporter can use this
+* to flush/sync the caches if needed.
+*
+* This callback is optional.
+*
+* Returns:
+*
+* 0 on success or a negative error code on failure.
+*/
+   int (*begin_access)(struct dma_buf_attachment *, struct sg_table *,
+   enum dma_data_direction);
+
+   /**
+* @end_access:
+*
+* This is called from dma_buf_end_access() when a device driver is
+* done accessing the data of the DMABUF. The exporter can use this
+* to flush/sync the caches if needed.
+*
+* This callback is optional.
+*
+* Returns:
+*
+* 0 on success or a negative error code on failure.
+*/
+   int (*end_access)(struct dma_buf_attachment *, struct sg_table *,
+ enum dma_data_direction);
+
/**
 * @mmap:
 *
@@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf,
  int dma_buf_pin(struct dma_buf_attachment *attach);
  void dma_buf_unpin(struct dma_buf_attachment *attach);
  
+int dma_buf_begin_access(struct dma_buf_attachment *attach,

+   

Re: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-20 Thread kernel test robot
Hi Paul,

kernel test robot noticed the following build warnings:

[auto build test WARNING on usb/usb-testing]
[also build test WARNING on usb/usb-next usb/usb-linus drm-misc/drm-misc-next 
lwn/docs-next linus/master v6.7 next-20240119]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:
https://github.com/intel-lab-lkp/linux/commits/Paul-Cercueil/dma-buf-Add-dma_buf_-begin-end-_access/20240119-221604
base:   https://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb.git 
usb-testing
patch link:
https://lore.kernel.org/r/20240119141402.44262-2-paul%40crapouillou.net
patch subject: [PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()
config: arm-randconfig-001-20240120 
(https://download.01.org/0day-ci/archive/20240121/202401210406.yygvcac1-...@intel.com/config)
compiler: clang version 18.0.0git (https://github.com/llvm/llvm-project 
d92ce344bf641e6bb025b41b3f1a77dd25e2b3e9)
reproduce (this is a W=1 build): 
(https://download.01.org/0day-ci/archive/20240121/202401210406.yygvcac1-...@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot 
| Closes: 
https://lore.kernel.org/oe-kbuild-all/202401210406.yygvcac1-...@intel.com/

All warnings (new ones prefixed by >>):

>> drivers/dma-buf/dma-buf.c:1608: warning: Cannot understand  * 
>> @dma_buf_begin_access - Call before any hardware access from/to the DMABUF
on line 1608 - I thought it was a doc line
>> drivers/dma-buf/dma-buf.c:1640: warning: Cannot understand  * 
>> @dma_buf_end_access - Call after any hardware access from/to the DMABUF
on line 1640 - I thought it was a doc line


vim +1608 drivers/dma-buf/dma-buf.c

  1606  
  1607  /**
> 1608   * @dma_buf_begin_access - Call before any hardware access from/to the 
> DMABUF
  1609   * @attach: [in]attachment used for hardware access
  1610   * @sg_table:   [in]scatterlist used for the DMA transfer
  1611   * @direction:  [in]direction of DMA transfer
  1612   */
  1613  int dma_buf_begin_access(struct dma_buf_attachment *attach,
  1614   struct sg_table *sgt, enum dma_data_direction 
dir)
  1615  {
  1616  struct dma_buf *dmabuf;
  1617  bool cookie;
  1618  int ret;
  1619  
  1620  if (WARN_ON(!attach))
  1621  return -EINVAL;
  1622  
  1623  dmabuf = attach->dmabuf;
  1624  
  1625  if (!dmabuf->ops->begin_access)
  1626  return 0;
  1627  
  1628  cookie = dma_fence_begin_signalling();
  1629  ret = dmabuf->ops->begin_access(attach, sgt, dir);
  1630  dma_fence_end_signalling(cookie);
  1631  
  1632  if (WARN_ON_ONCE(ret))
  1633  return ret;
  1634  
  1635  return 0;
  1636  }
  1637  EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
  1638  
  1639  /**
> 1640   * @dma_buf_end_access - Call after any hardware access from/to the 
> DMABUF
  1641   * @attach: [in]attachment used for hardware access
  1642   * @sg_table:   [in]scatterlist used for the DMA transfer
  1643   * @direction:  [in]direction of DMA transfer
  1644   */
  1645  int dma_buf_end_access(struct dma_buf_attachment *attach,
  1646 struct sg_table *sgt, enum dma_data_direction 
dir)
  1647  {
  1648  struct dma_buf *dmabuf;
  1649  bool cookie;
  1650  int ret;
  1651  
  1652  if (WARN_ON(!attach))
  1653  return -EINVAL;
  1654  
  1655  dmabuf = attach->dmabuf;
  1656  
  1657  if (!dmabuf->ops->end_access)
  1658  return 0;
  1659  
  1660  cookie = dma_fence_begin_signalling();
  1661  ret = dmabuf->ops->end_access(attach, sgt, dir);
  1662  dma_fence_end_signalling(cookie);
  1663  
  1664  if (WARN_ON_ONCE(ret))
  1665  return ret;
  1666  
  1667  return 0;
  1668  }
  1669  EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
  1670  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


[PATCH v5 1/6] dma-buf: Add dma_buf_{begin,end}_access()

2024-01-19 Thread Paul Cercueil
These functions should be used by device drivers when they start and
stop accessing the data of DMABUF. It allows DMABUF importers to cache
the dma_buf_attachment while ensuring that the data they want to access
is available for their device when the DMA transfers take place.

Signed-off-by: Paul Cercueil 

---
v5: New patch
---
 drivers/dma-buf/dma-buf.c | 66 +++
 include/linux/dma-buf.h   | 37 ++
 2 files changed, 103 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 8fe5aa67b167..a8bab6c18fcd 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -830,6 +830,8 @@ static struct sg_table * __map_dma_buf(struct 
dma_buf_attachment *attach,
  * - dma_buf_mmap()
  * - dma_buf_begin_cpu_access()
  * - dma_buf_end_cpu_access()
+ * - dma_buf_begin_access()
+ * - dma_buf_end_access()
  * - dma_buf_map_attachment_unlocked()
  * - dma_buf_unmap_attachment_unlocked()
  * - dma_buf_vmap_unlocked()
@@ -1602,6 +1604,70 @@ void dma_buf_vunmap_unlocked(struct dma_buf *dmabuf, 
struct iosys_map *map)
 }
 EXPORT_SYMBOL_NS_GPL(dma_buf_vunmap_unlocked, DMA_BUF);
 
+/**
+ * @dma_buf_begin_access - Call before any hardware access from/to the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]direction of DMA transfer
+ */
+int dma_buf_begin_access(struct dma_buf_attachment *attach,
+struct sg_table *sgt, enum dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->begin_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->begin_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_begin_access, DMA_BUF);
+
+/**
+ * @dma_buf_end_access - Call after any hardware access from/to the DMABUF
+ * @attach:[in]attachment used for hardware access
+ * @sg_table:  [in]scatterlist used for the DMA transfer
+ * @direction:  [in]direction of DMA transfer
+ */
+int dma_buf_end_access(struct dma_buf_attachment *attach,
+  struct sg_table *sgt, enum dma_data_direction dir)
+{
+   struct dma_buf *dmabuf;
+   bool cookie;
+   int ret;
+
+   if (WARN_ON(!attach))
+   return -EINVAL;
+
+   dmabuf = attach->dmabuf;
+
+   if (!dmabuf->ops->end_access)
+   return 0;
+
+   cookie = dma_fence_begin_signalling();
+   ret = dmabuf->ops->end_access(attach, sgt, dir);
+   dma_fence_end_signalling(cookie);
+
+   if (WARN_ON_ONCE(ret))
+   return ret;
+
+   return 0;
+}
+EXPORT_SYMBOL_NS_GPL(dma_buf_end_access, DMA_BUF);
+
 #ifdef CONFIG_DEBUG_FS
 static int dma_buf_debug_show(struct seq_file *s, void *unused)
 {
diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
index 8ff4add71f88..8ba612c7cc16 100644
--- a/include/linux/dma-buf.h
+++ b/include/linux/dma-buf.h
@@ -246,6 +246,38 @@ struct dma_buf_ops {
 */
int (*end_cpu_access)(struct dma_buf *, enum dma_data_direction);
 
+   /**
+* @begin_access:
+*
+* This is called from dma_buf_begin_access() when a device driver
+* wants to access the data of the DMABUF. The exporter can use this
+* to flush/sync the caches if needed.
+*
+* This callback is optional.
+*
+* Returns:
+*
+* 0 on success or a negative error code on failure.
+*/
+   int (*begin_access)(struct dma_buf_attachment *, struct sg_table *,
+   enum dma_data_direction);
+
+   /**
+* @end_access:
+*
+* This is called from dma_buf_end_access() when a device driver is
+* done accessing the data of the DMABUF. The exporter can use this
+* to flush/sync the caches if needed.
+*
+* This callback is optional.
+*
+* Returns:
+*
+* 0 on success or a negative error code on failure.
+*/
+   int (*end_access)(struct dma_buf_attachment *, struct sg_table *,
+ enum dma_data_direction);
+
/**
 * @mmap:
 *
@@ -606,6 +638,11 @@ void dma_buf_detach(struct dma_buf *dmabuf,
 int dma_buf_pin(struct dma_buf_attachment *attach);
 void dma_buf_unpin(struct dma_buf_attachment *attach);
 
+int dma_buf_begin_access(struct dma_buf_attachment *attach,
+struct sg_table *sgt, enum dma_data_direction dir);
+int dma_buf_end_access(struct dma_buf_attachment *attach,
+  struct sg_table *sgt, enum