On Wed, 9 Jan 2019 at 20:07, Ard Biesheuvel <ard.biesheu...@linaro.org> wrote: > > On Wed, 9 Jan 2019 at 16:30, Carsten Haitzler <carsten.haitz...@arm.com> > wrote: > > > > On 09/01/2019 14:48, Ard Biesheuvel wrote: > > > On Wed, 9 Jan 2019 at 15:38, Carsten Haitzler <carsten.haitz...@arm.com> > > > wrote: > > >> On 09/01/2019 14:36, Ard Biesheuvel wrote: > > >> > > >> On Wed, 9 Jan 2019 at 15:34, Carsten Haitzler <carsten.haitz...@arm.com> > > >> wrote: > > >> > > >> On 09/01/2019 14:33, Bero Rosenkränzer wrote: > > >> > > >> On Wed, 9 Jan 2019 at 14:55, Carsten Haitzler <carsten.haitz...@arm.com> > > >> wrote: > > >> > > >> My understanding is that ARM can't do "WC" in a guaranteed way like x86, > > >> so turning it off is the right thing to do anyway, > > >> > > >> My understanding too. > > >> > > >> FWIW I've added the fix to the OpenMandriva distro kernel > > >> https://github.com/OpenMandrivaSoftware/linux/commit/657041c5665c681d4519cf8297e0b8799b929f86 > > >> Let's see if any user starts screaming ;) > > >> > > >> ttyl > > >> bero > > >> > > >> let's see,. i have put in a patch to the internal kernel patch review > > >> before i send off to dri-devel. it's exactly your patch there just with > > >> a commit log explaining why. > > >> > > >> So what exactly is it about x86 style wc that ARM cannot do? > > >> > > >> From Pavel Shamis here at ARM: > > >> > > >> "Short version. > > >> > > >> X86 has well define behavior for WC memory – it combines multiples > > >> consecutive stores (has to be aligned to the cache line ) in 64B cache > > >> line writes over PCIe. > > >> > > >> On Arm WC corresponds to Normal NC. Arm uarch does not do combining to > > >> cache line size. On some uarch we do 16B combining but not cache line. > > >> > > >> The first uarch that will be doing cache line size combining is Aries. > > >> > > >> It is important to note that WC is an opportunistic optimization and > > >> the software/hardware should not make an assumption that it always > > >> “combines” (true for x86 and arm)" > > >> > > > OK, so that only means that ARM WC mappings may behave more like x86 > > > uncached mappings than x86 WC mappings. It does not explain why things > > > break if we use them. > > > > > > The problem with using uncached mappings here is that it breaks use > > > cases that expect memory semantics, for unaligned access or DC ZVA > > > instructions. At least VDPAU on nouveau breaks due to this, and likely > > > many more other use cases as well. > > > > For amdgpu though it works and this is and AMD+Radeon only code path. At > > least it works on the only ARM system I have an AMD GPU plugged into. > > you need the same fix for SynQuacer. Gettign a fix upstream like this > > will alleaviet a reasonable amount of pain for end-users even if not > > perfect. > > > > I do not plan on going any further with this patch because it's for my > > tx2 and that is my ONLY workstation at work and it takes like 10 minutes > > per reboot cycle. I have many things to do and getting my gfx card to a > > working state was the primary focus. Spending days just rebooting to try > > things with something I am not familiar with (thwe ttm mappings) is not > > something I have time for. Looking at the history of other bugs that > > affect WC/UC mappings in radeon/madgpu shows that this is precisely the > > kind of fix that has been done multiple times in the past for x86 and > > obviously some MIPS and PPC systems. there's mountains of precedent that > > this is a quick and simple fix that has been implemented many time in > > the past, so from that point of view I think its a decent fix in and of > > itself when it comes to time vs. reward. > > > > I can confirm that this change fixes all the issues I observed on AMD > Seattle with HD5450 and HD7450 cards which use the Radeon driver (not > the amdpgu one) > > So I will attempt to dig into this a bit further myself, and hopefully > find something that carries over to amdgpu as well, so I may ask you > to test something if I do. > > > It may not be perfect, but it is better than it was and other MIPS/PPC > > and even x86 32bit systems already need this kind of fix. In the same > > way it seems ARM needs it too and no one to date has bothered upstream. > > I'd rather things improve for at least some set of people than they do > > not improve at all for an undefined amount of time. Note that working is > > an improvement to "fast but doesn't work" in my book. :) Don't get me > > wrong. Looking for a better fix in the meantime,if one could exist, is a > > positive thing. It's not something I can get stuck into as above. > > > > I'd just like to see if we can fix properly before we upstream a hack.
OK, so it seems that if drm_arch_can_wc_memory() returns true, the TTM code also prefers uncached WC mappings over cacheable mappings for system memory, resulting in mismatched attributes. I am not sure what the rationale is for this, but I'd be much happier having a fix that turns normal-nc mappings into cacheable ones than into device ones. Could you please give the following patch a spin (and revert the other change)? --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c @@ -180,7 +180,9 @@ void amdgpu_bo_placement_from_domain(struct amdgpu_bo *abo, u32 domain) places[c].fpfn = 0; places[c].lpfn = 0; places[c].flags = TTM_PL_FLAG_SYSTEM; - if (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC) + if (!IS_ENABLED(CONFIG_ARM64) && + !IS_ENABLED(CONFIG_ARM) && + (flags & AMDGPU_GEM_CREATE_CPU_GTT_USWC)) places[c].flags |= TTM_PL_FLAG_WC | TTM_PL_FLAG_UNCACHED; else _______________________________________________ cross-distro mailing list cross-distro@lists.linaro.org https://lists.linaro.org/mailman/listinfo/cross-distro