Since this is my first post here, I guess I should introduce myself. 
I'm a volunteer developer for PrimeGrid.  I've managed to make a couple 
of native BOINC applications for them, but one of them, a CUDA app, is 
doing strange things in some cases.

Here's the main problem (that I'm asking about here) in the stderr output:

<core_client_version>6.10.17</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
<stderr_txt>
Sieve started: 420825000000000 <= p < 420826000000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.

Thread 0 completed
Waiting for threads to exit
Sieve complete: 420825000000000 <= p < 420826000000000
count=29694269,sum=0x6a4d4aa8c5ece825
Elapsed time: 1690.46 sec. (0.05 init + 1690.41 sieve) at 591620 p/sec.
Processor time: 10.50 sec. (0.06 init + 10.44 sieve) at 95793042 p/sec.
Average processor utilization: 1.24 (init), 0.01 (sieve)
Sieve started: 420825000000000 <= p < 420826000000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.

Thread 0 completed
Waiting for threads to exit
Sieve complete: 420825000000000 <= p < 420826000000000
count=29694269,sum=0x6a4d4aa8c5ece825
Elapsed time: 1690.48 sec. (0.05 init + 1690.43 sieve) at 591612 p/sec.
Processor time: 10.50 sec. (0.05 init + 10.45 sieve) at 95701374 p/sec.
Average processor utilization: 1.03 (init), 0.01 (sieve)
Sieve started: 420825000000000 <= p < 420826000000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.

Thread 0 completed
Waiting for threads to exit
Sieve complete: 420825000000000 <= p < 420826000000000
count=29694269,sum=0x6a4d4aa8c5ece825
Elapsed time: 1690.48 sec. (0.05 init + 1690.43 sieve) at 591612 p/sec.
Processor time: 10.51 sec. (0.05 init + 10.46 sieve) at 95609881 p/sec.
Average processor utilization: 1.03 (init), 0.01 (sieve)
Sieve started: 420825000000000 <= p < 420826000000000
Thread 0 starting
Detected GPU 0: GeForce 9800 GT
Detected compute capability: 1.1
Detected 14 multiprocessors.

Thread 0 completed
Waiting for threads to exit
Sieve complete: 420825000000000 <= p < 420826000000000
count=29694269,sum=0x6a4d4aa8c5ece825
Elapsed time: 1690.47 sec. (0.05 init + 1690.42 sieve) at 591614 p/sec.
Processor time: 10.49 sec. (0.05 init + 10.44 sieve) at 95793042 p/sec.
Average processor utilization: 1.03 (init), 0.01 (sieve)

</stderr_txt>
]]>

The app completes, apparently successfully, but then for some reason 
BOINC restarts it. Again and again! Also, although the app ran for ~10 
CPU-seconds at a time four times (that's normal for CUDA apps), and had 
6,724.62 seconds of runtime in total, BOINC recorded 0 seconds of CPU time.

If you're interested, the code is v0.1.5, from here: 
http://github.com/Ken-g6/PSieve-CUDA

After that last fprintf in main.c, I do only two things. First, I 
re-raise any SIGINT, SIGTERM, or SIGHUP that the process may have 
received. And second I call boinc_finish, with the argument EXIT_SUCCESS 
as evidenced by another line in that result.

I should add that there are also cases where the app restarts on 
failure. Here's one:

<core_client_version>6.10.58</core_client_version>
<![CDATA[
<message>
Unzul�ssige Funktion. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
Sieve started: 456189000000000 <= p < 456190000000000
Thread 0 starting
Detected GPU 0: Device Emulation (CPU)
Detected compute capability: 9999.9999
Detected 16 multiprocessors.
Insufficient available memory on GPU 0.
16:42:43 (6472): called boinc_finish
Sieve started: 456189000000000 <= p < 456190000000000
Thread 0 starting
Detected GPU 0: Device Emulation (CPU)
Detected compute capability: 9999.9999
Detected 16 multiprocessors.
Insufficient available memory on GPU 0.
16:45:26 (8132): called boinc_finish

</stderr_txt>
]]>

The reason for failure isn't relevant here (trying to run on the CUDA 
emulator.) The fact that it tried it again, after once calling 
boinc_finish(1), is. Plus, there are no signals trapped on that path.

So why did BOINC restart my app? And why didn't it count any of the 
runtime? Should I not re-raise those signals? I built the app with the 
development files that came with Ubuntu 9.04: version 6.2.18-3ubuntu1. 
Is that too old? Any other ideas?

Thanks!

Ken
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to