Uff, well that's quite a problem you ran into here.

IOMMU might not help here, cause when it would be the GPU we would have made a mapping once and then the page in question is never unmapped (IIRC).

To confirm if it is really the GPU writing those bytes I would add a trace point to amdgpu_ttm_tt_populate() to see which pages the GPU got assigned.

If the page with the corruption is not in that list it is unlikely (but not impossible) that the GPU is the one doing the corruption.

Good luck,
Christian.

Am 14.02.2017 um 14:20 schrieb Nicolai Hähnle:
Hi all,

on an amd-staging-4.9 kernel with lock debugging and KASAN enabled, I am seeing a bug where I suspect that the GPU may be writing into system memory where it shouldn't.

I can reproduce errors fairly reliable by running a parallel piglit run on 8 cores with Tonga.

See exhibit1 and exhibit2 for two of the errors that were reported. As you can see, poison data of a dead object was overwritten.

If that was done by a use-after-free in kernel code, I would expect to see a KASAN error about it, but I don't. Furthermore, the pattern of overwritten values is quite unusual: single bytes, with 8 byte stride, many times all of them the same value. This is the kind of pattern that could fit GPU writes to an 8-bit texture.

See kasan-corrupted for another type of report that I've seen. This report looks like KASAN's internal data structures were corrupted, leading to a crash.

Needless to say, while I can reproduce those crashes fairly reliably, they are totally non-deterministic.

So the question is how to figure out where the bad memory writes happen.

I noticed that the IOMMU on the system was disabled by the BIOS, so I enabled it, in the hopes that that would catch bad GPU behavior.

Well, this leads to lots of IO_PAGE_FAULT message during the amdgpu module initialization (see dmesg-iommu). When running piglit, however, I get the same type of random memory corruption errors / crashes as before, and no IOMMU errors.

Any ideas on (a) what kind of tools could be helpful in tracking this problem down (if any...), and (b) where in the code the problem lies?

I suspect something's wrong with GART mappings when buffers are moved, but that's pretty vague...

Thanks,
Nicolai


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to