On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote:
>
> On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler <carsten.haitz...@arm.com> 
> wrote:
> >
> > On 09/01/2019 14:48, Ard Biesheuvel wrote:
> > > On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler <carsten.haitz...@arm.com> 
> > > wrote:
> > >> On 09/01/2019 14:36, Ard Biesheuvel wrote:
> > >>
> > >> On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler <carsten.haitz...@arm.com> 
> > >> wrote:
> > >>
> > >> On 09/01/2019 14:33, Bero Rosenkränzer wrote:
> > >>
> > >> On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler <carsten.haitz...@arm.com> 
> > >> wrote:
> > >>
> > >> My understanding is that ARM can't do "WC" in a guaranteed way like x86,
> > >> so turning it off is the right thing to do anyway,
> > >>
> > >> My understanding too.
> > >>
> > >> FWIW I've added the fix to the OpenMandriva distro kernel
> > >> https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8297e0b8799b929f86
> > >> Let's see if any user starts screaming ;)
> > >>
> > >> ttyl
> > >> bero
> > >>
> > >> let's see,. i have put in a patch to the internal kernel patch review 
> > >> before i send off to dri-devel. it's exactly your patch there just with 
> > >> a commit log explaining why.
> > >>
> > >> So what exactly is it about x86 style wc that ARM cannot do?
> > >>
> > >> From Pavel Shamis here at ARM:
> > >>
> > >> "Short version.
> > >>
> > >> X86 has well define behavior for WC memory – it combines multiples 
> > >> consecutive stores (has to be aligned to the cache line ) in 64B cache 
> > >> line writes over PCIe.
> > >>
> > >> On Arm WC corresponds to Normal NC. Arm uarch does not do combining to 
> > >> cache line size. On some uarch we do 16B combining but not cache line.
> > >>
> > >> The first uarch that will be doing cache line size combining is Aries.
> > >>
> > >>  It is important to note that WC is an opportunistic optimization and 
> > >> the software/hardware should not make an assumption that it always 
> > >> “combines” (true for x86 and arm)"
> > >>
> > > OK, so that only means that ARM WC mappings may behave more like x86
> > > uncached mappings than x86 WC mappings. It does not explain why things
> > > break if we use them.
> > >
> > > The problem with using uncached mappings here is that it breaks use
> > > cases that expect memory semantics, for unaligned access or DC ZVA
> > > instructions. At least VDPAU on nouveau breaks due to this, and likely
> > > many more other use cases as well.
> >
> > For amdgpu though it works and this is and AMD+Radeon only code path. At
> > least it works on the only ARM system I have an AMD GPU plugged into.
> > you need the same fix for SynQuacer. Gettign a fix upstream like this
> > will alleaviet a reasonable amount of pain for end-users even if not
> > perfect.
> >
> > I do not plan on going any further with this patch because it's for my
> > tx2 and that is my ONLY workstation at work and it takes like 10 minutes
> > per reboot cycle. I have many things to do and getting my gfx card to a
> > working state was the primary focus. Spending days just rebooting to try
> > things with something I am not familiar with (thwe ttm mappings) is not
> > something I have time for. Looking at the history of other bugs that
> > affect WC/UC mappings in radeon/madgpu shows that this is precisely the
> > kind of fix that has been done multiple times in the past for x86 and
> > obviously some MIPS and PPC systems. there's mountains of precedent that
> > this is a quick and simple fix that has been implemented many time in
> > the past, so from that point of view I think its a decent fix in and of
> > itself when it comes to time vs. reward.
> >
>
> I can confirm that this change fixes all the issues I observed on AMD
> Seattle with HD5450 and HD7450 cards which use the Radeon driver (not
> the amdpgu one)
>
> So I will attempt to dig into this a bit further myself, and hopefully
> find something that carries over to amdgpu as well, so I may ask you
> to test something if I do.
>
> > It may not be perfect, but it is better than it was and other MIPS/PPC
> > and even x86 32bit systems already need this kind of fix. In the same
> > way it seems ARM needs it too and no one to date has bothered upstream.
> > I'd rather things improve for at least some set of people than they do
> > not improve at all for an undefined amount of time. Note that working is
> > an improvement to "fast but doesn't work" in my book. :) Don't get me
> > wrong. Looking for a better fix in the meantime,if one could exist, is a
> > positive thing. It's not something I can get stuck into as above.
> >
>
> I'd just like to see if we can fix properly before we upstream a hack.

OK, so it seems that if drm_arch_can_wc_memory() returns true, the TTM
code also prefers uncached WC mappings over cacheable mappings for
system memory, resulting in mismatched attributes. I am not sure what
the rationale is for this, but I'd be much happier having a fix that
turns normal-nc mappings into cacheable ones than into device ones.

Could you please give the following patch a spin (and revert the other change)?

--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct
amdgpu_bo *abo, u32 domain)
                places[c].fpfn = 0;
                places[c].lpfn = 0;
                places[c].flags = TTM_PL_FLAG_SYSTEM;
-               if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)
+               if (!IS_ENABLED(CONFIG_ARM64) &&
+                   !IS_ENABLED(CONFIG_ARM) &&
+                   (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC))
                        places[c].flags |= TTM_PL_FLAG_WC |
                                TTM_PL_FLAG_UNCACHED;
                else
_______________________________________________
cross-distro mailing list
cross-distro@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/cross-distro

Reply via email to