[PyCUDA] Dealing with driver timeouts in long running kernels

Dan Goodman Tue, 16 Nov 2010 17:26:10 -0800

Hi all,

I have a problem that can be split into pieces of different sizes.Essentially, the larger the size is, the more efficiently it runs.However, on Windows (and I understand similar things happen on Linux) asingle GPU kernel launch cannot take more than 5 seconds on XP or 2seconds on Vista/Win7, or the Timeout Detection and Recovery (TDR)system will terminate it and raise an error (also causing the screen toflash). My problem is that I want to run my kernels for as long aspossible for maximum efficiency, but I don't know how long the kernellaunch will take as a function of problem size until I run it. I couldprofile my functions and work out something that would probably work,but this is for a software package that will be used by third parties,and I'd like it to be handled automatically (and preferably without thescreen flashes, which will disturb users).


Has anyone worked out a good way of dealing with this?

One option is to increase the TDR window as detailed in:

http://www.microsoft.com/whdc/device/display/wddm_timeout.mspx

This might have adverse effects though, and I'm not sure all users of mypackage would be happy changing these values (it's also not automatic).

Another option is to have two GPUs, one of which is not attached to amonitor and only used in compute mode (as discussed athttp://forums.nvidia.com/index.php?showtopic=171630). Again, fine for me(I have two), but not so good for users who I guess in many cases willonly have one.

A final option that I thought of would be to check for a launch timeoutfailure after each kernel launch, and if it happens, divide my problemsize by two and try again, repeating until I don't get any launchfailures. The trouble with this approach is that I'll get multiplefailures and screen flashes before it settles down to a value thatworks, wasting a little bit of time but more importantly being quitealarming. It also doesn't feel very elegant... ;-)


Any other ideas or experiences dealing with this problem?

Dan

_______________________________________________
PyCUDA mailing list
[email protected]
http://lists.tiker.net/listinfo/pycuda

[PyCUDA] Dealing with driver timeouts in long running kernels

Reply via email to