[boinc_dev] BOINC server code - common errors

Richard Haselgrove Tue, 03 Sep 2013 09:11:38 -0700

Would it be possible to 'harden' the BOINC server operations console, please? 
I'm thinking about adding prominent warnings against common - but far-reaching 
- errors, and perhaps even some sanity checking on inputs.


Originally, BOINC was developed as a cheap, lightweight platform which would 
allow under-funded - even unfunded - scientific researchers to tap into the 
supercomputer-like power of what we now know as cloud or distributed computing. 
In those days, the concept of a single individual acting as 'Project Scientist' 
- defining the research, coding the science application, and administering the 
server - was feasible. For small, simple, projects - perhaps with a single 
application, running on a limited subset of the major platforms - that might 
still be the case. But larger projects have needed to diversify, with 
application development no longer in the hands of server administration staff: 
and in some projects, server administration - which has developed into a 
specialism in its own right - might not even take place on the same continent 
as application development.

I'm prompted to write by yet another example of a (potential) disaster in the 
making.

It involves a Beta app. It always seems to involve a Beta app, and the BOINC 
server software supports the concept of declaring an app to be Beta on a live 
production server. This may be a mistake.

An application developer in late Beta stages may want to iron the last few bugs 
out of a misbehaving application quickly. This can require quick incremental 
upgrades of the application itself (maybe he or she wants to add some extra 
debug output), some quick turnround of artifically shortened test workunits, 
and so on. The temptation is for the server administrator to hand over the keys 
to the application developer, and allow him or her to deploy Beta application 
versions by themselves - especially outside standard administration hours in 
the country where the server is located.

Development and administration are separate and distinct skills, and should not 
be confused.

The case which first started me thinking along these lines - now thankfully 
solved - was the project which deployed several versions of a CPU-only, 
multi-threaded application under an ATI-GPU plan class. [I think the original 
administrator had moved on to pastures new, the stand-in wasn't fully up to 
speed, and the developer didn't know how much that bit of administration 
mattered...]

The current problems - and I apologise in advance, but I'm going to have to 
name names for the explanation to make any sense at all - involves GPUGrid.

GPUGrid have massive computational needs, and as such have specialised in GPU 
processing, specifically NVidia cards. Their main application is described as 
"Long runs (8-12 hours on fastest card)" - and they really do mean hours, even 
on the best-performing Kepler class cards. That adds its own complexities when 
testing new applications, of course: and it also puts pressure on the project 
to support newer generations of hardware as soon as they become available in 
the shops: first Fermi, then Kepler, now Titan. That pressure is a two-way 
street - the extra computing power is useful for the project researchers, and 
volunteers are anxious to bring their fastest and newest GPUs to the party.

All of which brings into play the scenario I described above: rapid deployment 
of Beta applications, a mixture of shortened and full-runtime testing, and 
deployment by an application developer not fully trained in the nuances of 
server administration. How well does the BOINC server software environment cope?

Badly.

The problem that I've seen most often - and repeated again here - is that 
application developers don't appreciate how important - how critical, even - 
<rsc_fpops_est> is to the proper operation of 
http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation , and from there the 
proper operation of the closely-integrated 
http://boinc.berkeley.edu/trac/wiki/CreditNew . Mistakes in this area tend to 
grow into a firestorm on the project message boards...

In this case, the project has settled on tasks with a standardised estimate of 
<rsc_fpops_est> 5000000000000000 - that's five PetaFpops, if I've put the 
commas in the right place. And those tasks run for about 10 hours on my GTX 670.

But in the current rapid Beta phase - working to get CUDA 5.5 running on Titan 
hardware - I've seen shortened tasks with runtimes of 100 minutes, 10 minutes, 
even 1 minute: but all with the same 5 PetaFpops estimate. It's a simple 
mistake, a common mistake, an obvious mistake, and I don't blame an application 
developer under stress for making it.

But it's the sort of mistake BOINC reacts very badly to. GPU task runtime 
estimation relies - critically - on accurate determination of the speed of the 
hardware. My GTX 670 is given - in units of what I call "advertising FLOPS" - a 
speed rating of 2915 GFLOPS peak by the BOINC client (although I don't think 
that number is used by the server anywhere).

Instead, the initial speed estimate for a new app_version on a new host is 
derived from the CPU speed:

    <app_name>acemdbeta</app_name>
    <version_num>809</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>0.350000</avg_ncpus>
    <max_ncpus>0.666596</max_ncpus>
    <flops>294520563362.637020</flops>
    <plan_class>cuda55</plan_class>

approaching 300 GigaFlops.

But after the first 100 or so tasks have been returned - by, naturally, the 
fastest hosts with the fastest GPUs - the project has an app_version average 
which is applied to new hosts:

    <app_name>acemdbeta</app_name>
    <version_num>810</version_num>
    <platform>windows_intelx86</platform>
    <avg_ncpus>0.350000</avg_ncpus>
    <max_ncpus>0.666596</max_ncpus>
    <flops>65779930729483.820000</flops>
    <plan_class>cuda55</plan_class>

Over 65 TeraFlops

Only after each host has 'completed' (success and validation) 10 tasks is the 
host's own performance assessed and used. And in cases like this, it matters 
critically whether those first 10 tasks are of the 1-minute or the 10-hour 
variety. And if the first 10 tasks all take 10 hours, we're on to the next Beta 
app and the cycle starts all over again. Look at the history of 'APR' (Average 
Processing Rate) for my host, noting in particular the number of 'completed' 
tasks the average has been computed over each time. 
http://www.gpugrid.net/host_app_versions.php?hostid=132158 - that's the same 
GPU, unchanged since I built the machine.

I'll leave you to read the project message boards to discover all of the 
consequences. Unhappy volunteers, and tasks aborted with 
EXIT_TIME_LIMIT_EXCEEDED, are two of the more obvious ones.

Perhaps as a simple first step, the whole of RuntimeEstimation should be 
automatically disabled when an installed application is designated as Beta?
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

[boinc_dev] BOINC server code - common errors

Reply via email to