Gabor Gombas wrote, On 23.07.2009 11:56 Uhr:
> On Thu, Jul 23, 2009 at 10:18:33AM +0200, Oliver Bock wrote:
>
>   
>> But that's not all: have a look in client/app_control.cpp at 
>> ACTIVE_TASK_SET::exit_tasks() and ACTIVE_TASK::kill_task(). The app is 
>> killed 
>> five seconds after a normal shutdown was initiated. If the app fails to 
>> shutdown itself, the client then "kills" the app the hard way and the 
>> described problem might still occurr...
>>     
>
> The real solution is to bug Nvidia to fix the CUDA framework so a
> crashing/disappearing host application can't cause the GPU to crash/lock
> up.

I'm not sure that this would be feasible. The only way I can imagine 
that would ensure a proper cleanup would be to manage the GPU device 
memory in the process management of the operating system, which I'm not 
sure is possible to do within e.g. a device driver.

> In the mean time, write the controlling application carefully so it
> can react to SIGTERM etc. in a timely manner.
>   

Sure.

But the problem I see with the current BOINC Client is that an App may 
not immediately (i.e. within 5 seconds) react to a quit message for a 
number of reasons (including bugs in the BOINC code) and in particular 
may not be as responsive when it is nice'd. Sending a non-catchable 
signal 5 seconds after a message is a bit too harsh and dangerous for 
GPU Apps under current conditions.

I'd rather stretch the escalation for GPU Applications, e.g.

- send quit message
- wait 10 seconds
- send a catchable signal (HUP, TERM, QUIT) that can be dealt with in a 
signal handler
- wait 20 seconds
- send a SIGKILL to free at least the CPU

If the SIGKILL actually kills a process that's still there, I'd notify 
either the project or the user, because that means that something is 
wrong with the App or the machine; maybe mark the current task as Client 
Error (because a note on the screen is useless if the graphics is 
corrupted).

In any case I think the current behavior of exit_tasks() (SIGKILL 5 secs 
after a message) is too 'dangerous' for systems running current CUDA Apps.

Best,
Bernd     
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to