Re: [boinc_dev] BOINC server code - common errors

Eric J Korpela Wed, 04 Sep 2013 09:23:20 -0700

The same thing happens on non-beta applications as well.   It's a
combination of wide performance ranges, and poor estimates of actual result
flops and GPU performance.  The assumptions that go into those FLOPs
calculations are uniformly bad to start with.   In most cases (on SETI@home)
the problem goes away as the estimates get more realistic over time, but
I've seen diverging runaway APR when outliers weren't handled correctly.
 There are projects that do essentially no result checking and short
circuited results are included in the estimates, which makes the problem
worse.


I would offer as an alternative that runtime estimation be done only based
on the initial conservative estimates and the host_app_version averages
rather than the app_version averages.  Easier to implement would be maxing
out the FLOP/s estimates at a couple times the rate estimated from the GPU
properties.




On Tue, Sep 3, 2013 at 9:09 AM, Richard Haselgrove <
[email protected]> wrote:

> Would it be possible to 'harden' the BOINC server operations console,
> please? I'm thinking about adding prominent warnings against common - but
> far-reaching - errors, and perhaps even some sanity checking on inputs.
>
> Originally, BOINC was developed as a cheap, lightweight platform which
> would allow under-funded - even unfunded - scientific researchers to tap
> into the supercomputer-like power of what we now know as cloud or
> distributed computing. In those days, the concept of a single individual
> acting as 'Project Scientist' - defining the research, coding the science
> application, and administering the server - was feasible. For small,
> simple, projects - perhaps with a single application, running on a limited
> subset of the major platforms - that might still be the case. But larger
> projects have needed to diversify, with application development no longer
> in the hands of server administration staff: and in some projects, server
> administration - which has developed into a specialism in its own right -
> might not even take place on the same continent as application development.
>
> I'm prompted to write by yet another example of a (potential) disaster in
> the making.
>
> It involves a Beta app. It always seems to involve a Beta app, and the
> BOINC server software supports the concept of declaring an app to be Beta
> on a live production server. This may be a mistake.
>
> An application developer in late Beta stages may want to iron the last few
> bugs out of a misbehaving application quickly. This can require quick
> incremental upgrades of the application itself (maybe he or she wants to
> add some extra debug output), some quick turnround of artifically shortened
> test workunits, and so on. The temptation is for the server administrator
> to hand over the keys to the application developer, and allow him or her to
> deploy Beta application versions by themselves - especially outside
> standard administration hours in the country where the server is located.
>
> Development and administration are separate and distinct skills, and
> should not be confused.
>
> The case which first started me thinking along these lines - now
> thankfully solved - was the project which deployed several versions of a
> CPU-only, multi-threaded application under an ATI-GPU plan class. [I think
> the original administrator had moved on to pastures new, the stand-in
> wasn't fully up to speed, and the developer didn't know how much that bit
> of administration mattered...]
>
> The current problems - and I apologise in advance, but I'm going to have
> to name names for the explanation to make any sense at all - involves
> GPUGrid.
>
> GPUGrid have massive computational needs, and as such have specialised in
> GPU processing, specifically NVidia cards. Their main application is
> described as "Long runs (8-12 hours on fastest card)" - and they really do
> mean hours, even on the best-performing Kepler class cards. That adds its
> own complexities when testing new applications, of course: and it also puts
> pressure on the project to support newer generations of hardware as soon as
> they become available in the shops: first Fermi, then Kepler, now Titan.
> That pressure is a two-way street - the extra computing power is useful for
> the project researchers, and volunteers are anxious to bring their fastest
> and newest GPUs to the party.
>
> All of which brings into play the scenario I described above: rapid
> deployment of Beta applications, a mixture of shortened and full-runtime
> testing, and deployment by an application developer not fully trained in
> the nuances of server administration. How well does the BOINC server
> software environment cope?
>
> Badly.
>
> The problem that I've seen most often - and repeated again here - is that
> application developers don't appreciate how important - how critical, even
> - <rsc_fpops_est> is to the proper operation of
> http://boinc.berkeley.edu/trac/wiki/RuntimeEstimation , and from there
> the proper operation of the closely-integrated
> http://boinc.berkeley.edu/trac/wiki/CreditNew . Mistakes in this area
> tend to grow into a firestorm on the project message boards...
>
> In this case, the project has settled on tasks with a standardised
> estimate of <rsc_fpops_est> 5000000000000000 - that's five PetaFpops, if
> I've put the commas in the right place. And those tasks run for about 10
> hours on my GTX 670.
>
> But in the current rapid Beta phase - working to get CUDA 5.5 running on
> Titan hardware - I've seen shortened tasks with runtimes of 100 minutes, 10
> minutes, even 1 minute: but all with the same 5 PetaFpops estimate. It's a
> simple mistake, a common mistake, an obvious mistake, and I don't blame an
> application developer under stress for making it.
>
> But it's the sort of mistake BOINC reacts very badly to. GPU task runtime
> estimation relies - critically - on accurate determination of the speed of
> the hardware. My GTX 670 is given - in units of what I call "advertising
> FLOPS" - a speed rating of 2915 GFLOPS peak by the BOINC client (although I
> don't think that number is used by the server anywhere).
>
> Instead, the initial speed estimate for a new app_version on a new host is
> derived from the CPU speed:
>
>     <app_name>acemdbeta</app_name>
>     <version_num>809</version_num>
>     <platform>windows_intelx86</platform>
>     <avg_ncpus>0.350000</avg_ncpus>
>     <max_ncpus>0.666596</max_ncpus>
>     <flops>294520563362.637020</flops>
>     <plan_class>cuda55</plan_class>
>
> approaching 300 GigaFlops.
>
> But after the first 100 or so tasks have been returned - by, naturally,
> the fastest hosts with the fastest GPUs - the project has an app_version
> average which is applied to new hosts:
>
>     <app_name>acemdbeta</app_name>
>     <version_num>810</version_num>
>     <platform>windows_intelx86</platform>
>     <avg_ncpus>0.350000</avg_ncpus>
>     <max_ncpus>0.666596</max_ncpus>
>     <flops>65779930729483.820000</flops>
>     <plan_class>cuda55</plan_class>
>
> Over 65 TeraFlops
>
> Only after each host has 'completed' (success and validation) 10 tasks is
> the host's own performance assessed and used. And in cases like this, it
> matters critically whether those first 10 tasks are of the 1-minute or the
> 10-hour variety. And if the first 10 tasks all take 10 hours, we're on to
> the next Beta app and the cycle starts all over again. Look at the history
> of 'APR' (Average Processing Rate) for my host, noting in particular the
> number of 'completed' tasks the average has been computed over each time.
> http://www.gpugrid.net/host_app_versions.php?hostid=132158 - that's the
> same GPU, unchanged since I built the machine.
>
> I'll leave you to read the project message boards to discover all of the
> consequences. Unhappy volunteers, and tasks aborted
> with EXIT_TIME_LIMIT_EXCEEDED, are two of the more obvious ones.
>
> Perhaps as a simple first step, the whole of RuntimeEstimation should be
> automatically disabled when an installed application is designated as Beta?
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.
>
>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] BOINC server code - common errors

Reply via email to