Am 13.07.19 um 22:24 schrieb Felix Kuehling:
Am 2019-04-30 um 1:03 p.m. schrieb Koenig, Christian:

The only real solution I can see is to be able to reliable kill shaders
in an OOM situation.


Well, we can in fact preempt our compute shaders with low latency.
Killing a KFD process will do exactly that.


I've taken a look at that thing as well and to be honest it is not even
remotely sufficient.

We need something which stops the hardware *immediately* from accessing
system memory, and not wait for the SQ to kill all waves, flush caches
etc...

One possibility I'm playing around with for a while is to replace the
root PD for the VMIDs in question on the fly. E.g. we just let it point
to some dummy which redirects everything into nirvana.

But implementing this is easier said than done...

Warming up this thread, since I just fixed another bug that was enabled by 
artificial memory pressure due to the GTT limit.

I think disabling the PD for the VMIDs is a good idea. A problem is that HWS 
firmware updates PD pointers in the background for its VMIDs. So this would 
require a reliable and fast way to kill the HWS first.

Well we don't necessary need to completely kill the HWS. What we need is to 
suspend it, kill a specific process and resume it later on.

As far as I can see the concept with the HWS interaction was to use a ring 
buffer with async feedback when something is done.

That is really convenient for performative and reliable operation, but 
unfortunately not if you need to kill of some processing immediately.

So something like setting a bit in a register to suspend the HWS, kill the 
VMIDs, set a flag in the HWS runlist to stop it from scheduling a specific 
process once more and then resume the HWS is what is needed here.


An alternative I thought about is, disabling bus access at the BIF level if 
that's possible somehow. Basically we would instantaneously kill all GPU system 
memory access, signal all fences or just remove all fences from all BO 
reservations (reservation_object_add_excl_fence(resv, NULL)) to allow memory to 
be freed, let the OOM killer do its thing, and when the dust settles, reset the 
GPU.

Yeah, thought about that as well. The problem with this approach is that it is 
rather invasive.

E.g. stopping the BIF means stopping it for everybody and not just the process 
which is currently killed and when we reset the GPU it is actually quite likely 
that we lose the content of VRAM.

Regards,
Christian.


Regards,
  Felix

Regards,
Christian.


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to