Since R570 firmware got enabled, a number of GPUs have not being successfully suspend/resuming. Lyude showed it on RTX6000, which I reproduced and wasted a lot of time down various rabbit holes.
There are two required fixes, the first one adds proper sequence numbers to the rpc messages, which fixes a bunch of NOCAT asserts. Then we have to pass the runtime vs non-runtime state down to the GSP fbsr code. This however requires replacing a bool with an enum which refactors quite a bunch of interfaces unfortunately, but it was the cleanest way to do it. The final patch hooks up the interface so normal suspend doesn't set the GCOFF flags. Dave.
