Would it be possible to 'harden' the BOINC server operations console, please? I'm thinking about adding prominent warnings against common - but far-reaching - errors, and perhaps even some sanity checking on inputs.
Originally, BOINC was developed as a cheap, lightweight platform which would allow under-funded - even unfunded - scientific researchers to tap into the supercomputer-like power of what we now know as cloud or distributed computing. In those days, the concept of a single individual acting as 'Project Scientist' - defining the research, coding the science application, and administering the server - was feasible. For small, simple, projects - perhaps with a single application, running on a limited subset of the major platforms - that might still be the case. But larger projects have needed to diversify, with application development no longer in the hands of server administration staff: and in some projects, server administration - which has developed into a specialism in its own right - might not even take place on the same continent as application development. I'm prompted to write by yet another example of a (potential) disaster in the making. It involves a Beta app. It always seems to involve a Beta app, and the BOINC server software supports the concept of declaring an app to be Beta on a live production server. This may be a mistake. An application developer in late Beta stages may want to iron the last few bugs out of a misbehaving application quickly. This can require quick incremental upgrades of the application itself (maybe he or she wants to add some extra debug output), some quick turnround of artifically shortened test workunits, and so on. The temptation is for the server administrator to hand over the keys to the application developer, and allow him or her to deploy Beta application versions by themselves - especially outside standard administration hours in the country where the server is located. Development and administration are separate and distinct skills, and should not be confused. The case which first started me thinking along these lines - now thankfully solved - was the project which deployed several versions of a CPU-only, multi-threaded application under an ATI-GPU plan class. [I think the original administrator had moved on to pastures new, the stand-in wasn't fully up to speed, and the developer didn't know how much that bit of administration mattered...] The current problems - and I apologise in advance, but I'm going to have to name names for the explanation to make any sense at all - involves GPUGrid. GPUGrid have massive computational needs, and as such have specialised in GPU processing, specifically NVidia cards. Their main application is described as "Long runs (8-12 hours on fastest card)" - and they really do mean hours, even on the best-performing Kepler class cards. That adds its own complexities when testing new applications, of course: and it also puts pressure on the project to support newer generations of hardware as soon as they become available in the shops: first Fermi, then Kepler, now Titan. That pressure is a two-way street - the extra computing power is useful for the project researchers, and volunteers are anxious to bring their fastest and newest GPUs to the party. All of which brings into play the scenario I described above: rapid deployment of Beta applications, a mixture of shortened and full-runtime testing, and deployment by an application developer not fully trained in the nuances of server administration. How well does the BOINC server software environment cope? Badly. The problem that I've seen most often - and repeated again here - is that application developers don't appreciate how important - how critical, even - <rsc_fpops_est> is to the proper operation of http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation , and from there the proper operation of the closely-integrated http://boinc.berkeley.edu/trac/wiki/CreditNew . Mistakes in this area tend to grow into a firestorm on the project message boards... In this case, the project has settled on tasks with a standardised estimate of <rsc_fpops_est> 5000000000000000 - that's five PetaFpops, if I've put the commas in the right place. And those tasks run for about 10 hours on my GTX 670. But in the current rapid Beta phase - working to get CUDA 5.5 running on Titan hardware - I've seen shortened tasks with runtimes of 100 minutes, 10 minutes, even 1 minute: but all with the same 5 PetaFpops estimate. It's a simple mistake, a common mistake, an obvious mistake, and I don't blame an application developer under stress for making it. But it's the sort of mistake BOINC reacts very badly to. GPU task runtime estimation relies - critically - on accurate determination of the speed of the hardware. My GTX 670 is given - in units of what I call "advertising FLOPS" - a speed rating of 2915 GFLOPS peak by the BOINC client (although I don't think that number is used by the server anywhere). Instead, the initial speed estimate for a new app_version on a new host is derived from the CPU speed: <app_name>acemdbeta</app_name> <version_num>809</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>294520563362.637020</flops> <plan_class>cuda55</plan_class> approaching 300 GigaFlops. But after the first 100 or so tasks have been returned - by, naturally, the fastest hosts with the fastest GPUs - the project has an app_version average which is applied to new hosts: <app_name>acemdbeta</app_name> <version_num>810</version_num> <platform>windows_intelx86</platform> <avg_ncpus>0.350000</avg_ncpus> <max_ncpus>0.666596</max_ncpus> <flops>65779930729483.820000</flops> <plan_class>cuda55</plan_class> Over 65 TeraFlops Only after each host has 'completed' (success and validation) 10 tasks is the host's own performance assessed and used. And in cases like this, it matters critically whether those first 10 tasks are of the 1-minute or the 10-hour variety. And if the first 10 tasks all take 10 hours, we're on to the next Beta app and the cycle starts all over again. Look at the history of 'APR' (Average Processing Rate) for my host, noting in particular the number of 'completed' tasks the average has been computed over each time. http://www.gpugrid.net/host_app_versions.php?hostid=132158 - that's the same GPU, unchanged since I built the machine. I'll leave you to read the project message boards to discover all of the consequences. Unhappy volunteers, and tasks aborted with EXIT_TIME_LIMIT_EXCEEDED, are two of the more obvious ones. Perhaps as a simple first step, the whole of RuntimeEstimation should be automatically disabled when an installed application is designated as Beta? _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
