On Wed, May 3, 2023, 14:53 André Almeida <andrealm...@igalia.com> wrote:
> Em 03/05/2023 14:08, Marek Olšák escreveu: > > GPU hangs are pretty common post-bringup. They are not common per user, > > but if we gather all hangs from all users, we can have lots and lots of > > them. > > > > GPU hangs are indeed not very debuggable. There are however some things > > we can do: > > - Identify the hanging IB by its VA (the kernel should know it) > > How can the kernel tell which VA range is being executed? I only found > that information at mmCP_IB1_BASE_ regs, but as stated in this thread by > Christian this is not reliable to be read. > The kernel receives the VA and the size via the CS ioctl. When user queues are enabled, the kernel will no longer receive them. > > - Read and parse the IB to detect memory corruption. > > - Print active waves with shader disassembly if SQ isn't hung (often > > it's not). > > > > Determining which packet the CP is stuck on is tricky. The CP has 2 > > engines (one frontend and one backend) that work on the same command > > buffer. The frontend engine runs ahead, executes some packets and > > forwards others to the backend engine. Only the frontend engine has the > > command buffer VA somewhere. The backend engine only receives packets > > from the frontend engine via a FIFO, so it might not be possible to tell > > where it's stuck if it's stuck. > > Do they run at the same asynchronously or does the front end waits the > back end to execute? > They run asynchronously and should run asynchronously for performance, but they can be synchronized using a special packet (PFP_SYNC_ME). Marek > > > > When the gfx pipeline hangs outside of shaders, making a scandump seems > > to be the only way to have a chance at finding out what's going wrong, > > and only AMD-internal versions of hw can be scanned. > > > > Marek > > > > On Wed, May 3, 2023 at 11:23 AM Christian König > > <ckoenig.leichtzumer...@gmail.com > > <mailto:ckoenig.leichtzumer...@gmail.com>> wrote: > > > > Am 03.05.23 um 17:08 schrieb Felix Kuehling: > > > Am 2023-05-03 um 03:59 schrieb Christian König: > > >> Am 02.05.23 um 20:41 schrieb Alex Deucher: > > >>> On Tue, May 2, 2023 at 11:22 AM Timur Kristóf > > >>> <timur.kris...@gmail.com <mailto:timur.kris...@gmail.com>> > wrote: > > >>>> [SNIP] > > >>>>>>>> In my opinion, the correct solution to those problems > would be > > >>>>>>>> if > > >>>>>>>> the kernel could give userspace the necessary information > > about > > >>>>>>>> a > > >>>>>>>> GPU hang before a GPU reset. > > >>>>>>>> > > >>>>>>> The fundamental problem here is that the kernel doesn't > have > > >>>>>>> that > > >>>>>>> information either. We know which IB timed out and can > > >>>>>>> potentially do > > >>>>>>> a devcoredump when that happens, but that's it. > > >>>>>> > > >>>>>> Is it really not possible to know such a fundamental thing > > as what > > >>>>>> the > > >>>>>> GPU was doing when it hung? How are we supposed to do any > > kind of > > >>>>>> debugging without knowing that? > > >> > > >> Yes, that's indeed something at least I try to figure out for > years > > >> as well. > > >> > > >> Basically there are two major problems: > > >> 1. When the ASIC is hung you can't talk to the firmware engines > any > > >> more and most state is not exposed directly, but just through > some > > >> fw/hw interface. > > >> Just take a look at how umr reads the shader state from the > SQ. > > >> When that block is hung you can't do that any more and basically > > have > > >> no chance at all to figure out why it's hung. > > >> > > >> Same for other engines, I remember once spending a week > > figuring > > >> out why the UVD block is hung during suspend. Turned out to be a > > >> debugging nightmare because any time you touch any register of > that > > >> block the whole system would hang. > > >> > > >> 2. There are tons of things going on in a pipeline fashion or > even > > >> completely in parallel. For example the CP is just the beginning > > of a > > >> rather long pipeline which at the end produces a bunch of pixels. > > >> In almost all cases I've seen you ran into a problem > somewhere > > >> deep in the pipeline and only very rarely at the beginning. > > >> > > >>>>>> > > >>>>>> I wonder what AMD's Windows driver team is doing with this > > problem, > > >>>>>> surely they must have better tools to deal with GPU hangs? > > >>>>> For better or worse, most teams internally rely on scan dumps > via > > >>>>> JTAG > > >>>>> which sort of limits the usefulness outside of AMD, but also > > gives > > >>>>> you > > >>>>> the exact state of the hardware when it's hung so the > > hardware teams > > >>>>> prefer it. > > >>>>> > > >>>> How does this approach scale? It's not something we can ask > > users to > > >>>> do, and even if all of us in the radv team had a JTAG device, > we > > >>>> wouldn't be able to play every game that users experience > > random hangs > > >>>> with. > > >>> It doesn't scale or lend itself particularly well to external > > >>> development, but that's the current state of affairs. > > >> > > >> The usual approach seems to be to reproduce a problem in a lab > and > > >> have a JTAG attached to give the hw guys a scan dump and they can > > >> then tell you why something didn't worked as expected. > > > > > > That's the worst-case scenario where you're debugging HW or FW > > issues. > > > Those should be pretty rare post-bringup. But are there hangs > caused > > > by user mode driver or application bugs that are easier to debug > and > > > probably don't even require a GPU reset? For example most VM > faults > > > can be handled without hanging the GPU. Similarly, a shader in an > > > endless loop should not require a full GPU reset. In the KFD > compute > > > case, that's still preemptible and the offending process can be > > killed > > > with Ctrl-C or debugged with rocm-gdb. > > > > We also have infinite loop in shader abort for gfx and page faults > are > > pretty rare with OpenGL (a bit more often with Vulkan) and can be > > handled gracefully on modern hw (they just spam the logs). > > > > The majority of the problems is unfortunately that we really get hard > > hangs because of some hw issues. That can be caused by unlucky > timing, > > power management or doing things in an order the hw doesn't expected. > > > > Regards, > > Christian. > > > > > > > > It's more complicated for graphics because of the more complex > > > pipeline and the lack of CWSR. But it should still be possible to > do > > > some debugging without JTAG if the problem is in SW and not HW or > > FW. > > > It's probably worth improving that debugability without getting > > > hung-up on the worst case. > > > > > > Maybe user mode graphics queues will offer a better way of > > recovering > > > from these kinds of bugs, if the graphics pipeline can be unstuck > > > without a GPU reset, just by killing the offending user mode > queue. > > > > > > Regards, > > > Felix > > > > > > > > >> > > >> And yes that absolutely doesn't scale. > > >> > > >> Christian. > > >> > > >>> > > >>> Alex > > >> > > >