Right now it seems like it usually happens when a client is using an anonymous platform -- so it looks like that's probably what we're seeing. So this is something that needs to be fixed more on the client end?
On May 2, 2011, at 9:56 AM, Richard Haselgrove wrote: > Two things it would be worth checking on this front. > > 1) When new hosts get their first work, how far off 'realistic' are the > runtime estimates? > 2) Whenabouts in the host's lifecycle do the 'Maximum elapsed time exceeded' > errors happen? > > We've been seeing a lot of -177 errors at SETI@home under CreditNew, round > about and soon after each new host reaches the transition point after 10 > validated tasks. For SETI, task runtimes are routinely overestimated for new > joiners with modern hardware - target DCF was set at 0.4 for Astropulse, and > I think is even lower for MB tasks running on CPU. Then, for GPUs, the > over-estimation is even more marked. > > The errors we see come - I think only with anonymous platform - when the > <rsc_fpops_est> is reduced by the server after the 10th validation, but the > client is still using DCF correction built up by the earlier (full-estimate) > tasks. If you can persevere through the transition point, the errors go away > again - but if you keep resetting the host application detail records, it'll > keep coming back. > > > ----- Original Message ----- From: "Tom Ritter" <[email protected]> > To: "Travis Desell" <[email protected]> > Cc: <[email protected]>; <[email protected]> > Sent: Monday, May 02, 2011 2:46 AM > Subject: Re: [boinc_projects] clients not getting rsc_fpops_bound correctly? > > >> I've run a small network of hosts for testing workunit-creation scripts, and >> I've found that sometimes a host will get it's fpops estimation way out of >> wack under some circumstances (usually workunit mistakes or failures). >> >> It got to the point where I would add a bunch of debugging statements in >> like these: >> >> Index: lib/hostinfo.cpp >> =================================================================== >> --- lib/hostinfo.cpp (revision 22824) >> +++ lib/hostinfo.cpp (working copy) >> @@ -77,6 +77,7 @@ >> // fix foolishness that could result in negative value here >> // >> if (p_fpops < 0) p_fpops = -p_fpops; >> + printf("[>] Just set flops to %.2f in spot 5\n", p_fpops); >> continue; >> } >> else if (parse_double(buf, "<p_iops>", p_iops)) { >> Index: client/app_control.cpp >> =================================================================== >> --- client/app_control.cpp (revision 22824) >> +++ client/app_control.cpp (working copy) >> @@ -624,11 +624,13 @@ >> if (atp->task_state() != PROCESS_EXECUTING) continue; >> if (!atp->result->project->non_cpu_intensive && >> (atp->elapsed_time > atp->max_elapsed_time)) { >> msg_printf(atp->result->project, MSG_INFO, >> - "Aborting task %s: exceeded elapsed time >> limit %.2f (%.2fG/%.2fG)", >> - atp->result->name, atp->max_elapsed_time, >> - atp->result->wup->rsc_fpops_bound/1e9, >> - atp->result->avp->flops/1e9 >> - ); >> + "Aborting task %s: exceeded elapsed time >> limit %.2f > %.2f (%.2fG/%.2fG)", >> + atp->result->name, >> + atp->elapsed_time, >> + atp->max_elapsed_time, >> + atp->result->wup->rsc_fpops_bound/1e9, >> + atp->result->avp->flops/1e9 >> + ); >> atp->abort_task(ERR_RSC_LIMIT_EXCEEDED, "Maximum >> elapsed time exceeded"); >> did_anything = true; >> continue; >> >> And wrote a script to parse it (cause I kept forgetting what it meant): >> >> <?php >> >> $results = sscanf($line, "Aborting task %s exceeded elapsed time limit %f > >> %f (%fG/%fG)"); >> >> $elapsed_time = $results[1]; >> $max_time = $results[2]; >> $resource_bound = $results[3]; >> $current_flops = $results[4]; >> >> echo "This workunit was bound to run in $max_time seconds - but died after >> $elapsed_time seconds.\n"; >> echo "This fpops bound was {$resource_bound}G operations, and the client was >> operating at {$current_flops}G ops/sec\n"; >> >> >> To get a host back 'normal' I'd detach, turn off boinc, delete it from the >> database on the server, and remove all the files from the client (pretty >> much everything in /var/lib/boinc/ like client_state.xml and so on). >> >> I'm sure parts of this are completely overkill. >> >> -tom >> _______________________________________________ >> boinc_projects mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_projects >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. > > ---------------------------------------------------------------------------------------------------------- Travis Desell <deselt @ cs.rpi.edu> 1-518-867-1054 Adjunct Professor & Postdoctoral Research Assistant Rensselaer Polytechnic Institute, 110 8th Street, Troy NY 12180, USA http://www.cs.rpi.edu/~deselt/ MilkyWay@Home ( http://milkyway.cs.rpi.edu/ ) DNA@Home ( http://dnahome.cs.rpi.edu/ ) Worldwide Computing Laboratory ( http://wcl.cs.rpi.edu/ ) ---------------------------------------------------------------------------------------------------------- _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
