On Sat, 2012-04-28 at 16:49 +0200, Marcin Slusarz wrote: > On Thu, Apr 26, 2012 at 05:32:29PM +1000, Ben Skeggs wrote: > > On Wed, 2012-04-25 at 23:20 +0200, Marcin Slusarz wrote: > > > Overall idea: > > > Detect lockups by watching for timeouts (vm flush / fence), return -EIOs, > > > handle them at ioctl level, reset the GPU and repeat last ioctl. > > > > > > GPU reset is done by doing suspend / resume cycle with few tweaks: > > > - CPU-only bo eviction > > > - ignoring vm flush / fence timeouts > > > - shortening waits > > Okay. I've thought about this a bit for a couple of days and think I'll > > be able to coherently share my thoughts on this issue now :) > > > > Firstly, while I agree that we need to become more resilient to errors, > > I don't think that following in the radeon/intel footsteps with > > something (imo, hackish) like this is the right choice for us > > necessarily. > > This is not only radeon/intel way. Windows, since Vista SP1, does the > same - see http://msdn.microsoft.com/en-us/windows/hardware/gg487368. > It's funny how similar it is to this patch (I haven't seen this page earlier). Yes, I am aware of this feature in Windows. And I'm not arguing that something like it isn't necessary.
> > If you fear people will stop reporting bugs - don't. GPU reset is painfully > slow and can take up to 50 seconds (BO eviction is the most time consuming > part), so people will be annoyed enough to report them. > Currently, GPU lockups make users so angry, they frequently switch to blob > without even thinking about reporting anything. I'm not so concerned about the lost bug reports, I expect the same people that are actually willing to report bugs now will continue to do so :) > > > The *vast* majority of "lockups" we have are as a result of us badly > > mishandling exceptions reported to us by the GPU. There are a couple of > > exceptions, however, they're very rare.. > > > A very common example is where people gain DMA_PUSHERs for whatever > > reason, and things go haywire eventually. > > Nope, I had tens of lockups during testing, and only once I had DMA_PUSHER > before detecting GPU lockup. Out of curiosity, what were the lockup situations you were triggering exactly? > > > To handle a DMA_PUSHER > > sanely, generally you have to drop all pending commands for the channel > > (set GET=PUT, etc) and continue on. However, this leaves us with fences > > and semaphores unsignalled etc, causing issues further up the stack with > > perfectly good channels hanging on attempting to sync with the crashed > > channel etc. > > > > The next most common example I can think of is nv4x hardware, getting a > > LIMIT_COLOR/ZETA exception from PGRAPH, and then a hang. The solution > > is simple, learn how to handle the exception, log it, and PGRAPH > > survives. > > > > I strongly believe that if we focused our efforts on dealing with what > > the GPU reports to us a lot better, we'll find we really don't need such > > "lockup recovery". > > While I agree we need to improve on error handling to make "lockup recovery" > not needed, the reality is we can't predict everything and driver needs to > cope with its own bugs. Right, again, I don't disagree :) I think we can improve a lot on the big-hammer-suspend-the-gpu solution though, and instead reset only the faulting engine. It's (in theory) almost possible for us to do now, but I have a couple of reworks to areas related to this pending (basically, making the various driver subsystems more independent), which should be ready soon. This'll go a long way to making it very easy to reset a single engine, and likely result in *far* faster recovery from hangs. > > > I am, however, considering pulling the vm flush timeout error > > propagation and break-out-of-waits-on-signals that builds on it. As we > > really do need to become better at having killable processes if things > > go wrong :) > > Good :) > > Marcin > _______________________________________________ > Nouveau mailing list > Nouveau@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/nouveau _______________________________________________ Nouveau mailing list Nouveau@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/nouveau