Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Alyssa Rosenzweig
> The DDK blob has the ability to mark only certain areas of memory as
> coherent for performance reasons. For simple things like kmscube I would
> expect that it's basically write-only from the CPU and almost all memory the
> GPU touches isn't touched by the CPU. I.e. coherency isn't helping and the
> coherency traffic is probably expensive. Whether the complexity is worth it
> for "real" content I don't know - it may just be silly benchmarks that
> benefit.

Right, Panfrost userspace specifically assumes GPU reads to be expensive
and treats GPU memory as write-only *except* for a few special cases
(compute-like workloads, glReadPixels, some blits, etc).

The vast majority of the GPU memory - everything used in kmscube - will be
write-only to the CPU and fed directly into the display zero-copy (or
read by the GPU later as a dmabuf).


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Robin Murphy

On 2020-09-16 18:46, Rob Herring wrote:

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig
 wrote:



So I get a performance regression with the dma-coherent approach, even if it's
clearly the cleaner.


That's bizarre -- this should really be the faster of the two.


Coherency may not be free. CortexA9 had something like 4x slower
memcpy if SMP was enabled as an example. I don't know if there's
anything going on like that specifically here. If there's never any
CPU accesses mixed in with kmscube, then there would be no benefit to
coherency.


There will still be CPU benefits in terms of not having to spend time 
cache-cleaning every BO upon allocation, and less overhead on writing 
out descriptors in the first place (due to cacheable vs. non-cacheable).


I haven't tried the NSh hack on Juno, but I don't see any notable 
performance issue as-is - kmscube hits a solid 60FPS from the off (now 
that it works without spewing faults). Given that the hardware on Juno 
can be generously described as "less good", it would certainly be 
interesting to figure out what difference is at play here...


The usual argument against I/O coherency is that it adds latency to 
every access, but if you already have a coherent interconnect anyway 
then the sensible answer to that is implementing decent snoop filters, 
rather than making software more complicated.


Robin.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Steven Price

On 17/09/2020 11:51, Tomeu Vizoso wrote:

On 9/17/20 12:38 PM, Steven Price wrote:

On 16/09/2020 18:46, Rob Herring wrote:

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig
 wrote:


So I get a performance regression with the dma-coherent approach, 
even if it's

clearly the cleaner.


That's bizarre -- this should really be the faster of the two.


Coherency may not be free. CortexA9 had something like 4x slower
memcpy if SMP was enabled as an example. I don't know if there's
anything going on like that specifically here. If there's never any
CPU accesses mixed in with kmscube, then there would be no benefit to
coherency.


The DDK blob has the ability to mark only certain areas of memory as 
coherent for performance reasons. For simple things like kmscube I 
would expect that it's basically write-only from the CPU and almost 
all memory the GPU touches isn't touched by the CPU. I.e. coherency 
isn't helping and the coherency traffic is probably expensive. Whether 
the complexity is worth it for "real" content I don't know - it may 
just be silly benchmarks that benefit.


Or maybe it's only a problem for applications that do silly things? I 
don't think kmscube was ever optimized for performance.


Well doing silly things is almost the definition of a benchmark ;) A lot 
of the mobile graphics benchmarks suffer from not being very intelligent 
in how they render (e.g. many have meshes that are far too detailed so 
the triangles are smaller than the pixels).


Of course there are also applications that get things wrong too.

Steve
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Tomeu Vizoso

On 9/17/20 12:38 PM, Steven Price wrote:

On 16/09/2020 18:46, Rob Herring wrote:

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig
 wrote:


So I get a performance regression with the dma-coherent approach, 
even if it's

clearly the cleaner.


That's bizarre -- this should really be the faster of the two.


Coherency may not be free. CortexA9 had something like 4x slower
memcpy if SMP was enabled as an example. I don't know if there's
anything going on like that specifically here. If there's never any
CPU accesses mixed in with kmscube, then there would be no benefit to
coherency.


The DDK blob has the ability to mark only certain areas of memory as 
coherent for performance reasons. For simple things like kmscube I would 
expect that it's basically write-only from the CPU and almost all memory 
the GPU touches isn't touched by the CPU. I.e. coherency isn't helping 
and the coherency traffic is probably expensive. Whether the complexity 
is worth it for "real" content I don't know - it may just be silly 
benchmarks that benefit.


Or maybe it's only a problem for applications that do silly things? I 
don't think kmscube was ever optimized for performance.


Regards,

Tomeu
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Steven Price

On 16/09/2020 18:46, Rob Herring wrote:

On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig
 wrote:



So I get a performance regression with the dma-coherent approach, even if it's
clearly the cleaner.


That's bizarre -- this should really be the faster of the two.


Coherency may not be free. CortexA9 had something like 4x slower
memcpy if SMP was enabled as an example. I don't know if there's
anything going on like that specifically here. If there's never any
CPU accesses mixed in with kmscube, then there would be no benefit to
coherency.


The DDK blob has the ability to mark only certain areas of memory as 
coherent for performance reasons. For simple things like kmscube I would 
expect that it's basically write-only from the CPU and almost all memory 
the GPU touches isn't touched by the CPU. I.e. coherency isn't helping 
and the coherency traffic is probably expensive. Whether the complexity 
is worth it for "real" content I don't know - it may just be silly 
benchmarks that benefit.


Steve
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-17 Thread Alyssa Rosenzweig
> So I get a performance regression with the dma-coherent approach, even if it's
> clearly the cleaner.

That's bizarre -- this should really be the faster of the two.


signature.asc
Description: PGP signature
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-16 Thread Rob Herring
On Wed, Sep 16, 2020 at 11:04 AM Alyssa Rosenzweig
 wrote:
>
> > So I get a performance regression with the dma-coherent approach, even if 
> > it's
> > clearly the cleaner.
>
> That's bizarre -- this should really be the faster of the two.

Coherency may not be free. CortexA9 had something like 4x slower
memcpy if SMP was enabled as an example. I don't know if there's
anything going on like that specifically here. If there's never any
CPU accesses mixed in with kmscube, then there would be no benefit to
coherency.

Rob
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 0/3] drm: panfrost: Coherency support

2020-09-16 Thread Neil Armstrong
Hi Robin,

On 16/09/2020 01:51, Robin Murphy wrote:
> Hi all,
> 
> I polished up my original proof-of-concept a little while back, but now
> that I've got my hands on my Juno again I've been able to actually test
> it to my satisfaction, so here are proper patches!

I tested on the Kkadas VIM3, and yes it fixes the random FAULTS I have 
*without*:
[  152.417127] panfrost ffe4.gpu: gpu sched timeout, js=0, config=0x7300, 
status=0x58, head=0x3091400, tail=0x3091400, sched_job=4d83c2d7
[  152.530928] panfrost ffe4.gpu: js fault, js=1, status=INSTR_INVALID_ENC, 
head=0x30913c0, tail=0x30913c0
[  152.539797] panfrost ffe4.gpu: gpu sched timeout, js=1, config=0x7300, 
status=0x51, head=0x30913c0, tail=0x30913c0, sched_job=38cecaf6
[  156.943505] panfrost ffe4.gpu: js fault, js=0, status=TILE_RANGE_FAULT, 
head=0x3091400, tail=0x3091400

but, with this patchset, I get the following fps with kmscube:
Rendered 97 frames in 2.016291 sec (48.108145 fps)
Rendered 206 frames in 4.016723 sec (51.285584 fps)
Rendered 316 frames in 6.017208 sec (52.516052 fps)
Rendered 430 frames in 8.017456 sec (53.632975 fps)

but when I resurrect my BROKEN_NS patchset (simply disabling shareability), I 
get:
Rendered 120 frames in 2.000143 sec (59.995724 fps)
Rendered 241 frames in 4.016760 sec (59.998605 fps)
Rendered 362 frames in 6.033443 sec (59.998911 fps)
Rendered 482 frames in 8.033531 sec (59.998522 fps)

So I get a performance regression with the dma-coherent approach, even if it's
clearly the cleaner.

So:
Tested-by: Neil Armstrong 

Neil

> 
> It probably makes sense for patches #1 and #2 to stay together and both
> go via drm-misc, provided Will's OK with that.
> 
> Robin.
> 
> 
> Robin Murphy (3):
>   iommu/io-pgtable-arm: Support coherency for Mali LPAE
>   drm/panfrost: Support cache-coherent integrations
>   arm64: dts: meson: Describe G12b GPU as coherent
> 
>  arch/arm64/boot/dts/amlogic/meson-g12b.dtsi | 4 
>  drivers/gpu/drm/panfrost/panfrost_device.h  | 1 +
>  drivers/gpu/drm/panfrost/panfrost_drv.c | 2 ++
>  drivers/gpu/drm/panfrost/panfrost_gem.c | 2 ++
>  drivers/gpu/drm/panfrost/panfrost_mmu.c | 1 +
>  drivers/iommu/io-pgtable-arm.c  | 5 -
>  6 files changed, 14 insertions(+), 1 deletion(-)
> 

___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel