Re: noveau vs arm dma ops

2018-04-26 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
> > get_required_mask() is supposed to tell you if you are safe.  However
> > we are missing lots of implementations of it for iommus so you might get
> > some false negatives, improvements welcome.  It's been on my list of
> > things to fix in the DMA API, but it is nowhere near the top.
> 
> I hasn't come up in a while in some fireworks, so I honestly don't
> remember exactly what the issues have been. But
> 
> commit d766ef53006c2c38a7fe2bef0904105a793383f2
> Author: Chris Wilson 
> Date:   Mon Dec 19 12:43:45 2016 +
> 
> drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
> 
> and the various bits of code that a
> 
> $ git grep SWIOTLB -- drivers/gpu
> 
> turns up is what we're doing to hack around that stuff. And in general
> (there's some exceptions) gpus should be able to address everything,
> so I never fully understood where that's even coming from.

I'm pretty sure I've seen some oddly low dma masks in GPU drivers.  E.g.
duplicated in various AMD files:

adev->need_dma32 = false;
dma_bits = adev->need_dma32 ? 32 : 40;
r = pci_set_dma_mask(adev->pdev, DMA_BIT_MASK(dma_bits));
if (r) {
adev->need_dma32 = true;
dma_bits = 32;
dev_warn(adev->dev, "amdgpu: No suitable DMA available.\n");
}

synopsis:

drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:pdevinfo.dma_mask   
= DMA_BIT_MASK(32);
drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
DMA_BIT_MASK(32);
drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
DMA_BIT_MASK(32);

etnaviv gets it right:

drivers/gpu/drm/etnaviv/etnaviv_gpu.c:  u32 dma_mask = 
(u32)dma_get_required_mask(gpu->dev);


But yes, the swiotlb hackery really irks me.  I just have some more
important and bigger fires to fight first, but I plan to get back to the
root cause of that eventually.

> 
> >> - dma api hides the cache flushing requirements from us. GPUs love
> >>   non-snooped access, and worse give userspace control over that. We want
> >>   a strict separation between mapping stuff and flushing stuff. With the
> >>   IOMMU api we mostly have the former, but for the later arch maintainers
> >>   regularly tells they won't allow that. So we have drm_clflush.c.
> >
> > The problem is that a cache flushing API entirely separate is hard. That
> > being said if you look at my generic dma-noncoherent API series it tries
> > to move that way.  So far it is in early stages and apparently rather
> > buggy unfortunately.
> 
> I'm assuming this stuff here?
> 
> https://lkml.org/lkml/2018/4/20/146
> 
> Anyway got lost in all that work a bit, looks really nice.

That url doesn't seem to work currently.  But I am talking about the
thread titled '[RFC] common non-cache coherent direct dma mapping ops'

> Yeah the above is pretty much what we do on x86. dma-api believes
> everything is coherent, so dma_map_sg does the mapping we want and
> nothing else (minus swiotlb fun). Cache flushing, allocations, all
> done by the driver.

Which sounds like the right thing to do to me.

> On arm that doesn't work. The iommu api seems like a good fit, except
> the dma-api tends to get in the way a bit (drm/msm apparently has
> similar problems like tegra), and if you need contiguous memory
> dma_alloc_coherent is the only way to get at contiguous memory. There
> was a huge discussion years ago about that, and direct cma access was
> shot down because it would have exposed too much of the caching
> attribute mangling required (most arm platforms need wc-pages to not
> be in the kernel's linear map apparently).

Simple cma_alloc() doesn't do anything about cache handling, it
just is a very dumb allocator for large contiguous regions inside
a big pool.

I'm not the CMA maintainer, but in general I'd love to see an
EXPORT_SYMBOL_GPL slapped onto cma_alloc/release and drivers use
that were needed.  Using that plus dma_map*/dma_unmap* sounds like
a much saner interface than dma_alloc_attrs + DMA_ATTR_NON_CONSISTENT
or DMA_ATTR_NO_KERNEL_MAPPING.

You don't happen to have a pointer to that previous discussion?

> Anything that separate these 3 things more (allocation pools, mapping
> through IOMMUs and flushing cpu caches) sounds like the right
> direction to me. Even if that throws some portability across platforms
> away - drivers who want to control things in this much detail aren't
> really portable (without some serious work) anyway.

As long as we stay away from the dma coherent allocations separating
them is fine, and I'm working towards it.  dma coherent allocations on
the other hand are very hairy beasts, and provide by very different
implementations depending on the architecture, or often even depending
on the specifics of individual implementations inside the architecture.

So for your GPU/media case it seems like you'd better 

Re: noveau vs arm dma ops

2018-04-26 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 11:54:43PM +0100, Russell King - ARM Linux wrote:
> 
> if the memory was previously dirty (iow, CPU has written), you need to
> flush the dirty cache lines _before_ the GPU writes happen, but you
> don't know whether the CPU has speculatively prefetched, so you need
> to flush any prefetched cache lines before reading from the cacheable
> memory _after_ the GPU has finished writing.
> 
> Also note that "flush" there can be "clean the cache", "clean and
> invalidate the cache" or "invalidate the cache" as appropriate - some
> CPUs are able to perform those three operations, and the appropriate
> one depends on not only where in the above sequence it's being used,
> but also on what the operations are.

Agreed on all these counts.

> If we can agree a set of interfaces that allows _proper_ use of these
> facilities, one which can be used appropriately, then there shouldn't
> be a problem.  The DMA API does that via it's ideas about who owns a
> particular buffer (because of the above problem) and that's something
> which would need to be carried over to such a cache flushing API (it
> should be pretty obvious that having a GPU read or write memory while
> the cache for that memory is being cleaned will lead to unexpected
> results.)

I've been trying to come up with such an interface, for now only for
internal use in a generic set of noncoherent ops.  The API is basically
a variant of the existing dma_sync_single_to_device/cpu calls:

http://git.infradead.org/users/hch/misc.git/commitdiff/044dae5f94509288f4655de2f22cb0bea85778af

Review welcome!

> The next issue, which I've brought up before, is that exposing cache
> flushing to userspace on architectures where it isn't already exposed
> comes.  As has been shown by Google Project Zero, this risks exposing
> those architectures to Spectre and Meltdown exploits where they weren't
> at such a risk before.  (I've pretty much shown here that you _do_
> need to control which cache lines get flushed to make these exploits
> work, and flushing the cache by reading lots of data in liu of having
> the ability to explicitly flush bits of cache makes it very difficult
> to impossible for them to work.)

Extending dma coherence to userspace is going to be the next major
nightmare indeed.  I'm not sure how much of that actually still is
going on in the graphics world, but we'll need a coherent (pun intended)
plan how to deal with it.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-26 Thread Russell King - ARM Linux
(While there's a rain shower...)

On Thu, Apr 26, 2018 at 02:09:42AM -0700, Christoph Hellwig wrote:
> synopsis:
> 
> drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:pdevinfo.dma_mask 
>   = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);

This is for the AHB audio driver, and is a correct default on two counts:

1. It is historically usual to initialise DMA masks to 32-bit, and leave
   it to the device drivers to negotiate via the DMA mask functions if
   they wish to use higher orders of bits.

2. The AHB audio hardware in the HDMI block only supports 32-bit addresses.

What I've missed from the AHB audio driver is calling the DMA mask
functions... oops.  Patch below.

> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);

This is for the I2S audio driver, and I suspect is wrong - I doubt
that the I2S sub-device itself does any DMA what so ever.

8<===
From: Russell King 
Subject: drm: bridge: dw-hdmi: Negotiate dma mask with DMA API

DMA drivers are supposed to negotiate the DMA mask with the DMA API,
but this was missing from the AHB audio driver.  Add the necessary
call.

Signed-off-by: Russell King 
---
 drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c 
b/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
index cf3f0caf9c63..16c45b6cd6af 100644
--- a/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
+++ b/drivers/gpu/drm/bridge/synopsys/dw-hdmi-ahb-audio.c
@@ -539,6 +539,10 @@ static int snd_dw_hdmi_probe(struct platform_device *pdev)
unsigned revision;
int ret;
 
+   ret = dma_set_mask_and_coherent(dev, DMA_BIT_MASK(32));
+   if (ret)
+   return ret;
+
writeb_relaxed(HDMI_IH_MUTE_AHBDMAAUD_STAT0_ALL,
   data->base + HDMI_IH_MUTE_AHBDMAAUD_STAT0);
revision = readb_relaxed(data->base + HDMI_REVISION_ID);


-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 11:09 AM, Christoph Hellwig  wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> > get_required_mask() is supposed to tell you if you are safe.  However
>> > we are missing lots of implementations of it for iommus so you might get
>> > some false negatives, improvements welcome.  It's been on my list of
>> > things to fix in the DMA API, but it is nowhere near the top.
>>
>> I hasn't come up in a while in some fireworks, so I honestly don't
>> remember exactly what the issues have been. But
>>
>> commit d766ef53006c2c38a7fe2bef0904105a793383f2
>> Author: Chris Wilson 
>> Date:   Mon Dec 19 12:43:45 2016 +
>>
>> drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping
>>
>> and the various bits of code that a
>>
>> $ git grep SWIOTLB -- drivers/gpu
>>
>> turns up is what we're doing to hack around that stuff. And in general
>> (there's some exceptions) gpus should be able to address everything,
>> so I never fully understood where that's even coming from.
>
> I'm pretty sure I've seen some oddly low dma masks in GPU drivers.  E.g.
> duplicated in various AMD files:
>
> adev->need_dma32 = false;
> dma_bits = adev->need_dma32 ? 32 : 40;
> r = pci_set_dma_mask(adev->pdev, DMA_BIT_MASK(dma_bits));
> if (r) {
> adev->need_dma32 = true;
> dma_bits = 32;
> dev_warn(adev->dev, "amdgpu: No suitable DMA available.\n");
> }
>
> synopsis:
>
> drivers/gpu/drm/bridge/synopsys/dw-hdmi-i2s-audio.c:pdevinfo.dma_mask 
>   = DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);
> drivers/gpu/drm/bridge/synopsys/dw-hdmi.c:  pdevinfo.dma_mask = 
> DMA_BIT_MASK(32);
>
> etnaviv gets it right:
>
> drivers/gpu/drm/etnaviv/etnaviv_gpu.c:  u32 dma_mask = 
> (u32)dma_get_required_mask(gpu->dev);
>
>
> But yes, the swiotlb hackery really irks me.  I just have some more
> important and bigger fires to fight first, but I plan to get back to the
> root cause of that eventually.
>
>>
>> >> - dma api hides the cache flushing requirements from us. GPUs love
>> >>   non-snooped access, and worse give userspace control over that. We want
>> >>   a strict separation between mapping stuff and flushing stuff. With the
>> >>   IOMMU api we mostly have the former, but for the later arch maintainers
>> >>   regularly tells they won't allow that. So we have drm_clflush.c.
>> >
>> > The problem is that a cache flushing API entirely separate is hard. That
>> > being said if you look at my generic dma-noncoherent API series it tries
>> > to move that way.  So far it is in early stages and apparently rather
>> > buggy unfortunately.
>>
>> I'm assuming this stuff here?
>>
>> https://lkml.org/lkml/2018/4/20/146
>>
>> Anyway got lost in all that work a bit, looks really nice.
>
> That url doesn't seem to work currently.  But I am talking about the
> thread titled '[RFC] common non-cache coherent direct dma mapping ops'
>
>> Yeah the above is pretty much what we do on x86. dma-api believes
>> everything is coherent, so dma_map_sg does the mapping we want and
>> nothing else (minus swiotlb fun). Cache flushing, allocations, all
>> done by the driver.
>
> Which sounds like the right thing to do to me.
>
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> Simple cma_alloc() doesn't do anything about cache handling, it
> just is a very dumb allocator for large contiguous regions inside
> a big pool.
>
> I'm not the CMA maintainer, but in general I'd love to see an
> EXPORT_SYMBOL_GPL slapped onto cma_alloc/release and drivers use
> that were needed.  Using that plus dma_map*/dma_unmap* sounds like
> a much saner interface than dma_alloc_attrs + DMA_ATTR_NON_CONSISTENT
> or DMA_ATTR_NO_KERNEL_MAPPING.
>
> You don't happen to have a pointer to that previous discussion?

I'll try to dig them up, I tried to stay as far away from that
discussion as possible (since I have the luxury to not care for intel
gpus).

>> Anything that separate these 3 things more (allocation pools, mapping
>> through IOMMUs and flushing cpu caches) sounds like the right
>> direction to me. Even if that throws some portability across platforms
>> away - drivers who want to control things in this much detail aren't
>> really portable (without some serious work) anyway.
>
> As long as we stay away from the dma coherent 

Re: noveau vs arm dma ops

2018-04-26 Thread Daniel Vetter
On Thu, Apr 26, 2018 at 1:26 AM, Russell King - ARM Linux
 wrote:
> On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
>> On arm that doesn't work. The iommu api seems like a good fit, except
>> the dma-api tends to get in the way a bit (drm/msm apparently has
>> similar problems like tegra), and if you need contiguous memory
>> dma_alloc_coherent is the only way to get at contiguous memory. There
>> was a huge discussion years ago about that, and direct cma access was
>> shot down because it would have exposed too much of the caching
>> attribute mangling required (most arm platforms need wc-pages to not
>> be in the kernel's linear map apparently).
>
> I think you completely misunderstand ARM from what you've written above,
> and this worries me greatly about giving DRM the level of control that
> is being asked for.
>
> Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
> attributes are stored in the page tables.  These caches are inherently
> non-aliasing when there are multiple mappings (which is a great step
> forward compared to the previous aliasing caches.)
>
> As the cache attributes are stored in the page tables, this in theory
> allows different virtual mappings of the same physical memory to have
> different cache attributes.  However, there's a problem, and that's
> called speculative prefetching.
>
> Let's say you have one mapping which is cacheable, and another that is
> marked as write combining.  If a cache line is speculatively prefetched
> through the cacheable mapping of this memory, and then you read the
> same physical location through the write combining mapping, it is
> possible that you could read cached data.
>
> So, it is generally accepted that all mappings of any particular
> physical bit of memory should have the same cache attributes to avoid
> unpredictable behaviour.
>
> This presents a problem with what is generally called "lowmem" where
> the memory is mapped in kernel virtual space with cacheable
> attributes.  It can also happen with highmem if the memory is
> kmapped.
>
> This is why, on ARM, you can't use something like get_free_pages() to
> grab some pages from the system, pass it to the GPU, map it into
> userspace as write-combining, etc.  It _might_ work for some CPUs,
> but ARM CPUs vary in how much prefetching they do, and what may work
> for one particular CPU is in no way guaranteed to work for another
> ARM CPU.
>
> The official line from architecture folk is to assume that the caches
> infinitely speculate, are of infinite size, and can writeback *dirty*
> data at any moment.
>
> The way to stop things like speculative prefetches to particular
> physical memory is to, quite "simply", not have any cacheable
> mappings of that physical memory anywhere in the system.
>
> Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
> If you have, say, an 8MB buffer (for a 1080p frame) and you need to
> do a cache operation on that buffer, you'll be iterating over it
> 32 or maybe 64 bytes at a time "just in case" there's a cache line
> present.  Referring to my previous email, where I detailed the
> potential need for _two_ flushes, one before the GPU operation and
> one after, and this becomes _really_ expensive.  At that point, you're
> probably way better off using write-combine memory where you don't
> need to spend CPU cycles performing cache flushing - potentially
> across all CPUs in the system if cache operations aren't broadcasted.
>
> This isn't a simple matter of "just provide some APIs for cache
> operations" - there's much more that needs to be understood by
> all parties here, especially when we have GPU drivers that can be
> used with quite different CPUs.
>
> It may well be that for some combinations of CPUs and workloads, it's
> better to use write-combine memory without cache flushing, but for
> other CPUs that tradeoff (for the same workload) could well be
> different.
>
> Older ARMs get more interesting, because they have aliasing caches.
> That means the CPU cache aliases across different virtual space
> mappings in some way, which complicates (a) the mapping of memory
> and (b) handling the cache operations on it.
>
> It's too late for me to go into that tonight, and I probably won't
> be reading mail for the next week and a half, sorry.

I didn't know all the details well enough (and neither had the time to
write a few paragraphs like you did), but the above is what I had in
mind and meant. Sorry if my sloppy reply sounded like I'm mixing stuff
up.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 11:35:13PM +0200, Daniel Vetter wrote:
> On arm that doesn't work. The iommu api seems like a good fit, except
> the dma-api tends to get in the way a bit (drm/msm apparently has
> similar problems like tegra), and if you need contiguous memory
> dma_alloc_coherent is the only way to get at contiguous memory. There
> was a huge discussion years ago about that, and direct cma access was
> shot down because it would have exposed too much of the caching
> attribute mangling required (most arm platforms need wc-pages to not
> be in the kernel's linear map apparently).

I think you completely misunderstand ARM from what you've written above,
and this worries me greatly about giving DRM the level of control that
is being asked for.

Modern ARMs have a PIPT cache or a non-aliasing VIPT cache, and cache
attributes are stored in the page tables.  These caches are inherently
non-aliasing when there are multiple mappings (which is a great step
forward compared to the previous aliasing caches.)

As the cache attributes are stored in the page tables, this in theory
allows different virtual mappings of the same physical memory to have
different cache attributes.  However, there's a problem, and that's
called speculative prefetching.

Let's say you have one mapping which is cacheable, and another that is
marked as write combining.  If a cache line is speculatively prefetched
through the cacheable mapping of this memory, and then you read the
same physical location through the write combining mapping, it is
possible that you could read cached data.

So, it is generally accepted that all mappings of any particular
physical bit of memory should have the same cache attributes to avoid
unpredictable behaviour.

This presents a problem with what is generally called "lowmem" where
the memory is mapped in kernel virtual space with cacheable
attributes.  It can also happen with highmem if the memory is
kmapped.

This is why, on ARM, you can't use something like get_free_pages() to
grab some pages from the system, pass it to the GPU, map it into
userspace as write-combining, etc.  It _might_ work for some CPUs,
but ARM CPUs vary in how much prefetching they do, and what may work
for one particular CPU is in no way guaranteed to work for another
ARM CPU.

The official line from architecture folk is to assume that the caches
infinitely speculate, are of infinite size, and can writeback *dirty*
data at any moment.

The way to stop things like speculative prefetches to particular
physical memory is to, quite "simply", not have any cacheable
mappings of that physical memory anywhere in the system.

Now, cache flushes on ARM tend to be fairly expensive for GPU buffers.
If you have, say, an 8MB buffer (for a 1080p frame) and you need to
do a cache operation on that buffer, you'll be iterating over it
32 or maybe 64 bytes at a time "just in case" there's a cache line
present.  Referring to my previous email, where I detailed the
potential need for _two_ flushes, one before the GPU operation and
one after, and this becomes _really_ expensive.  At that point, you're
probably way better off using write-combine memory where you don't
need to spend CPU cycles performing cache flushing - potentially
across all CPUs in the system if cache operations aren't broadcasted.

This isn't a simple matter of "just provide some APIs for cache
operations" - there's much more that needs to be understood by
all parties here, especially when we have GPU drivers that can be
used with quite different CPUs.

It may well be that for some combinations of CPUs and workloads, it's
better to use write-combine memory without cache flushing, but for
other CPUs that tradeoff (for the same workload) could well be
different.

Older ARMs get more interesting, because they have aliasing caches.
That means the CPU cache aliases across different virtual space
mappings in some way, which complicates (a) the mapping of memory
and (b) handling the cache operations on it.

It's too late for me to go into that tonight, and I probably won't
be reading mail for the next week and a half, sorry.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 08:33:12AM -0700, Christoph Hellwig wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > - dma api hides the cache flushing requirements from us. GPUs love
> >   non-snooped access, and worse give userspace control over that. We want
> >   a strict separation between mapping stuff and flushing stuff. With the
> >   IOMMU api we mostly have the former, but for the later arch maintainers
> >   regularly tells they won't allow that. So we have drm_clflush.c.
> 
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

And if folk want a cacheable mapping with explicit cache flushing, the
cache flushing must not be defined in terms of "this is what CPU seems
to need" but from the point of view of a CPU with infinite prefetching,
infinite caching and infinite capacity to perform writebacks of dirty
cache lines at unexpected moments when the memory is mapped in a
cacheable mapping.

(The reason for that is you're operating in a non-CPU specific space,
so you can't make any guarantees as to how much caching or prefetching
will occur by the CPU - different CPUs will do different amounts.)

So, for example, the sequence:

GPU writes to memory
CPU reads from cacheable memory

if the memory was previously dirty (iow, CPU has written), you need to
flush the dirty cache lines _before_ the GPU writes happen, but you
don't know whether the CPU has speculatively prefetched, so you need
to flush any prefetched cache lines before reading from the cacheable
memory _after_ the GPU has finished writing.

Also note that "flush" there can be "clean the cache", "clean and
invalidate the cache" or "invalidate the cache" as appropriate - some
CPUs are able to perform those three operations, and the appropriate
one depends on not only where in the above sequence it's being used,
but also on what the operations are.

So, the above sequence could be:

CPU invalidates cache for memory
(due to possible dirty cache lines)
GPU writes to memory
CPU invalidates cache for memory
(to get rid of any speculatively prefetched
 lines)
CPU reads from cacheable memory

Yes, in the above case, _two_ cache operations are required to ensure
correct behaviour.  However, if you know for certain that the memory was
previously clean, then the first cache operation can be skipped.

What I'm pointing out is there's much more than just "I want to flush
the cache" here, which is currently what DRM seems to assume at the
moment with the code in drm_cache.c.

If we can agree a set of interfaces that allows _proper_ use of these
facilities, one which can be used appropriately, then there shouldn't
be a problem.  The DMA API does that via it's ideas about who owns a
particular buffer (because of the above problem) and that's something
which would need to be carried over to such a cache flushing API (it
should be pretty obvious that having a GPU read or write memory while
the cache for that memory is being cleaned will lead to unexpected
results.)

Also note that things get even more interesting in a SMP environment
if cache operations aren't broadcasted across the SMP cluster (which
means cache operations have to be IPI'd to other CPUs.)

The next issue, which I've brought up before, is that exposing cache
flushing to userspace on architectures where it isn't already exposed
comes.  As has been shown by Google Project Zero, this risks exposing
those architectures to Spectre and Meltdown exploits where they weren't
at such a risk before.  (I've pretty much shown here that you _do_
need to control which cache lines get flushed to make these exploits
work, and flushing the cache by reading lots of data in liu of having
the ability to explicitly flush bits of cache makes it very difficult
to impossible for them to work.)

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 5:33 PM, Christoph Hellwig  wrote:
> On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
>> > Coordinating the backport of a trivial helper in the arm tree is not
>> > the end of the world.  Really, this cowboy attitude is a good reason
>> > why graphics folks have such a bad rep.  You keep poking into random
>> > kernel internals, don't talk to anoyone and then complain if people
>> > are upset.  This shouldn't be surprising.
>>
>> Not really agreeing on the cowboy thing. The fundamental problem is that
>> the dma api provides abstraction that seriously gets in the way of writing
>> a gpu driver. Some examples:
>
> So talk to other people.  Maybe people share your frustation.  Or maybe
> other people have a way to help.
>
>> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>>   hacks to fall back to a cache of pages allocated using
>>   dma_alloc_coherent if you build a kernel with bounce buffers.
>
> get_required_mask() is supposed to tell you if you are safe.  However
> we are missing lots of implementations of it for iommus so you might get
> some false negatives, improvements welcome.  It's been on my list of
> things to fix in the DMA API, but it is nowhere near the top.

I hasn't come up in a while in some fireworks, so I honestly don't
remember exactly what the issues have been. But

commit d766ef53006c2c38a7fe2bef0904105a793383f2
Author: Chris Wilson 
Date:   Mon Dec 19 12:43:45 2016 +

drm/i915: Fallback to single PAGE_SIZE segments for DMA remapping

and the various bits of code that a

$ git grep SWIOTLB -- drivers/gpu

turns up is what we're doing to hack around that stuff. And in general
(there's some exceptions) gpus should be able to address everything,
so I never fully understood where that's even coming from.

>> - dma api hides the cache flushing requirements from us. GPUs love
>>   non-snooped access, and worse give userspace control over that. We want
>>   a strict separation between mapping stuff and flushing stuff. With the
>>   IOMMU api we mostly have the former, but for the later arch maintainers
>>   regularly tells they won't allow that. So we have drm_clflush.c.
>
> The problem is that a cache flushing API entirely separate is hard. That
> being said if you look at my generic dma-noncoherent API series it tries
> to move that way.  So far it is in early stages and apparently rather
> buggy unfortunately.

I'm assuming this stuff here?

https://lkml.org/lkml/2018/4/20/146

Anyway got lost in all that work a bit, looks really nice.

>> - dma api hides how/where memory is allocated. Kinda similar problem,
>>   except now for CMA or address limits. So either we roll our own
>>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>>   allocations because that's the only way. Which sucks, because we can't
>>   directly tell CMA how to back off if there's some way to make CMA memory
>>   available through other means (gpus love to hog all of memory, so we
>>   have shrinkers and everything).
>
> If you really care about doing explicitly cache flushing anyway (see
> above) allocating your own memory and mapping it where needed is by
> far the superior solution.  On cache coherent architectures
> dma_alloc_coherent is nothing but allocate memory + dma_map_single.
> On non coherent allocations the memory might come through a special
> pool or must be used through a special virtual address mapping that
> is set up either statically or dynamically.  For that case splitting
> allocation and mapping is a good idea in many ways, and I plan to move
> towards that once the number of dma mapping implementations is down
> to a reasonable number so that it can actually be done.

Yeah the above is pretty much what we do on x86. dma-api believes
everything is coherent, so dma_map_sg does the mapping we want and
nothing else (minus swiotlb fun). Cache flushing, allocations, all
done by the driver.

On arm that doesn't work. The iommu api seems like a good fit, except
the dma-api tends to get in the way a bit (drm/msm apparently has
similar problems like tegra), and if you need contiguous memory
dma_alloc_coherent is the only way to get at contiguous memory. There
was a huge discussion years ago about that, and direct cma access was
shot down because it would have exposed too much of the caching
attribute mangling required (most arm platforms need wc-pages to not
be in the kernel's linear map apparently).

Anything that separate these 3 things more (allocation pools, mapping
through IOMMUs and flushing cpu caches) sounds like the right
direction to me. Even if that throws some portability across platforms
away - drivers who want to control things in this much detail aren't
really portable (without some serious work) anyway.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 

Re: noveau vs arm dma ops

2018-04-25 Thread Christoph Hellwig
On Wed, Apr 25, 2018 at 12:04:29PM +0200, Daniel Vetter wrote:
> > Coordinating the backport of a trivial helper in the arm tree is not
> > the end of the world.  Really, this cowboy attitude is a good reason
> > why graphics folks have such a bad rep.  You keep poking into random
> > kernel internals, don't talk to anoyone and then complain if people
> > are upset.  This shouldn't be surprising.
> 
> Not really agreeing on the cowboy thing. The fundamental problem is that
> the dma api provides abstraction that seriously gets in the way of writing
> a gpu driver. Some examples:

So talk to other people.  Maybe people share your frustation.  Or maybe
other people have a way to help.

> - We never want bounce buffers, ever. dma_map_sg gives us that, so there's
>   hacks to fall back to a cache of pages allocated using
>   dma_alloc_coherent if you build a kernel with bounce buffers.

get_required_mask() is supposed to tell you if you are safe.  However
we are missing lots of implementations of it for iommus so you might get
some false negatives, improvements welcome.  It's been on my list of
things to fix in the DMA API, but it is nowhere near the top.

> - dma api hides the cache flushing requirements from us. GPUs love
>   non-snooped access, and worse give userspace control over that. We want
>   a strict separation between mapping stuff and flushing stuff. With the
>   IOMMU api we mostly have the former, but for the later arch maintainers
>   regularly tells they won't allow that. So we have drm_clflush.c.

The problem is that a cache flushing API entirely separate is hard. That
being said if you look at my generic dma-noncoherent API series it tries
to move that way.  So far it is in early stages and apparently rather
buggy unfortunately.

> - dma api hides how/where memory is allocated. Kinda similar problem,
>   except now for CMA or address limits. So either we roll our own
>   allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
>   we use dma_alloc_coherent and then grab the sgt to get at the CMA
>   allocations because that's the only way. Which sucks, because we can't
>   directly tell CMA how to back off if there's some way to make CMA memory
>   available through other means (gpus love to hog all of memory, so we
>   have shrinkers and everything).

If you really care about doing explicitly cache flushing anyway (see
above) allocating your own memory and mapping it where needed is by
far the superior solution.  On cache coherent architectures
dma_alloc_coherent is nothing but allocate memory + dma_map_single.
On non coherent allocations the memory might come through a special
pool or must be used through a special virtual address mapping that
is set up either statically or dynamically.  For that case splitting
allocation and mapping is a good idea in many ways, and I plan to move
towards that once the number of dma mapping implementations is down
to a reasonable number so that it can actually be done.
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-25 Thread Russell King - ARM Linux
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.

EXPORT_SYMBOL* means nothing as far as whether a driver should
be able to use the symbol or not - it merely means that the symbol is
made available to a module.  Modules cover much more than just device
drivers - we have library modules, filesystem modules, helper modules
to name a few non-driver classes of modules.

We also have symbols that are exported as part of the architecture
implementation detail of a public interface.  For example, the
public interface "copy_from_user" is implemented as an inline
function (actually several layers of inline functions) eventually
calling into arm_copy_from_user().  arm_copy_from_user() is exported,
but drivers (in fact no module) is allowed to make direct reference
to arm_copy_from_user() - it'd fail when software PAN is enabled.

The whole idea that "if a symbol is exported, it's fine for a driver
to use it" is a complete work of fiction, always has been, and always
will be.

We've had this with the architecture implementation details of the
DMA API before, and with the architecture implementation details of
the CPU cache flushing.  There's only so much commentry, or __
prefixes you can add to a symbol before things get rediculous, and
none of it stops people creating this abuse.  The only thing that
seems to prevent it is to make life hard for folk wanting to use
the symbols (eg, hiding the symbol prototype in a private header,
etc.)

Never, ever go under the covers of an interface.  If the interface
doesn't do what you want, _discuss_ it, don't just think "oh, that
architecture private facility looks like what I need, I'll use that
directly."

If you ever are on the side of trying to maintain those implementation
details that are abused in this way, you'll soon understand why this
behaviour by driver authors is soo annoying, and the maintainability
problems it creates.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


Re: noveau vs arm dma ops

2018-04-25 Thread Daniel Vetter
On Wed, Apr 25, 2018 at 01:54:39AM -0700, Christoph Hellwig wrote:
> [discussion about this patch, which should have been cced to the iommu
>  and linux-arm-kernel lists, but wasn't:
>  https://www.spinics.net/lists/dri-devel/msg173630.html]
> 
> On Wed, Apr 25, 2018 at 09:41:51AM +0200, Thierry Reding wrote:
> > > API from the iommu/dma-mapping code.  Drivers have no business poking
> > > into these details.
> > 
> > The interfaces that the above patch uses are all EXPORT_SYMBOL_GPL,
> > which is rather misleading if they are not meant to be used by drivers
> > directly.
> 
> The only reason the DMA ops are exported is because get_arch_dma_ops
> references (or in case of the coherent ones used to reference).  We
> don't let drivers assign random dma ops.
> 
> > 
> > > Thierry, please resend this with at least the iommu list and
> > > linux-arm-kernel in Cc to have a proper discussion on the right API.
> > 
> > I'm certainly open to help with finding a correct solution, but the
> > patch above was purposefully terse because this is something that I
> > hope we can get backported to v4.16 to unbreak Nouveau. Coordinating
> > such a backport between ARM and DRM trees does not sound like something
> > that would help getting this fixed in v4.16.
> 
> Coordinating the backport of a trivial helper in the arm tree is not
> the end of the world.  Really, this cowboy attitude is a good reason
> why graphics folks have such a bad rep.  You keep poking into random
> kernel internals, don't talk to anoyone and then complain if people
> are upset.  This shouldn't be surprising.

Not really agreeing on the cowboy thing. The fundamental problem is that
the dma api provides abstraction that seriously gets in the way of writing
a gpu driver. Some examples:

- We never want bounce buffers, ever. dma_map_sg gives us that, so there's
  hacks to fall back to a cache of pages allocated using
  dma_alloc_coherent if you build a kernel with bounce buffers.

- dma api hides the cache flushing requirements from us. GPUs love
  non-snooped access, and worse give userspace control over that. We want
  a strict separation between mapping stuff and flushing stuff. With the
  IOMMU api we mostly have the former, but for the later arch maintainers
  regularly tells they won't allow that. So we have drm_clflush.c.

- dma api hides how/where memory is allocated. Kinda similar problem,
  except now for CMA or address limits. So either we roll our own
  allocators and then dma_map_sg (and pray it doesn't bounce buffer), or
  we use dma_alloc_coherent and then grab the sgt to get at the CMA
  allocations because that's the only way. Which sucks, because we can't
  directly tell CMA how to back off if there's some way to make CMA memory
  available through other means (gpus love to hog all of memory, so we
  have shrinkers and everything).

For display drivers the dma api mostly works, until you start sharing
buffers with other devices.

So from our perspective it looks fairly often that core folks just don't
want to support gpu use-cases, so we play a bit more cowboy and get things
done some other way. Since this has been going on for years now we often
don't even bother to start a discussion first, since it tended to go
nowhere useful.
-Daniel

> > Granted, this issue could've been caught with a little more testing, but
> > in retrospect I think it would've been a lot better if ARM_DMA_USE_IOMMU
> > was just enabled unconditionally if it has side-effects that platforms
> > don't opt in to but have to explicitly opt out of.
> 
> Agreed on that count.  Please send a patch.

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
___
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx