Right now it seems like it usually happens when a client is using an anonymous 
platform -- so it looks like that's probably what we're seeing.  So this is 
something that needs to be fixed more on the client end?


On May 2, 2011, at 9:56 AM, Richard Haselgrove wrote:

> Two things it would be worth checking on this front.
> 
> 1) When new hosts get their first work, how far off 'realistic' are the 
> runtime estimates?
> 2) Whenabouts in the host's lifecycle do the 'Maximum elapsed time exceeded' 
> errors happen?
> 
> We've been seeing a lot of -177 errors at SETI@home under CreditNew, round 
> about and soon after each new host reaches the transition point after 10 
> validated tasks. For SETI, task runtimes are routinely overestimated for new 
> joiners with modern hardware - target DCF was set at 0.4 for Astropulse, and 
> I think is even lower for MB tasks running on CPU. Then, for GPUs, the 
> over-estimation is even more marked.
> 
> The errors we see come - I think only with anonymous platform - when the 
> <rsc_fpops_est> is reduced by the server after the 10th validation, but the 
> client is still using DCF correction built up by the earlier (full-estimate) 
> tasks. If you can persevere through the transition point, the errors go away 
> again - but if you keep resetting the host application detail records, it'll 
> keep coming back.
> 
> 
> ----- Original Message ----- From: "Tom Ritter" <[email protected]>
> To: "Travis Desell" <[email protected]>
> Cc: <[email protected]>; <[email protected]>
> Sent: Monday, May 02, 2011 2:46 AM
> Subject: Re: [boinc_projects] clients not getting rsc_fpops_bound correctly?
> 
> 
>> I've run a small network of hosts for testing workunit-creation scripts, and
>> I've found that sometimes a host will get it's fpops estimation way out of
>> wack under some circumstances (usually workunit mistakes or failures).
>> 
>> It got to the point where I would add a bunch of debugging statements in
>> like these:
>> 
>> Index: lib/hostinfo.cpp
>> ===================================================================
>> --- lib/hostinfo.cpp    (revision 22824)
>> +++ lib/hostinfo.cpp    (working copy)
>> @@ -77,6 +77,7 @@
>>            // fix foolishness that could result in negative value here
>>            //
>>            if (p_fpops < 0) p_fpops = -p_fpops;
>> +           printf("[>] Just set flops to %.2f in spot 5\n", p_fpops);
>>            continue;
>>        }
>>        else if (parse_double(buf, "<p_iops>", p_iops)) {
>> Index: client/app_control.cpp
>> ===================================================================
>> --- client/app_control.cpp      (revision 22824)
>> +++ client/app_control.cpp      (working copy)
>> @@ -624,11 +624,13 @@
>>        if (atp->task_state() != PROCESS_EXECUTING) continue;
>>               if (!atp->result->project->non_cpu_intensive &&
>> (atp->elapsed_time > atp->max_elapsed_time)) {
>>                       msg_printf(atp->result->project, MSG_INFO,
>> -                               "Aborting task %s: exceeded elapsed time
>> limit %.2f (%.2fG/%.2fG)",
>> -                               atp->result->name, atp->max_elapsed_time,
>> -                atp->result->wup->rsc_fpops_bound/1e9,
>> -                atp->result->avp->flops/1e9
>> -                       );
>> +                                  "Aborting task %s: exceeded elapsed time
>> limit %.2f > %.2f (%.2fG/%.2fG)",
>> +                                  atp->result->name,
>> +                                  atp->elapsed_time,
>> +                                  atp->max_elapsed_time,
>> +                                  atp->result->wup->rsc_fpops_bound/1e9,
>> +                                  atp->result->avp->flops/1e9
>> +                                  );
>>                       atp->abort_task(ERR_RSC_LIMIT_EXCEEDED, "Maximum
>> elapsed time exceeded");
>>                       did_anything = true;
>>                       continue;
>> 
>> And wrote a script to parse it (cause I kept forgetting what it meant):
>> 
>> <?php
>> 
>> $results = sscanf($line, "Aborting task %s exceeded elapsed time limit %f  >
>> %f (%fG/%fG)");
>> 
>> $elapsed_time = $results[1];
>> $max_time = $results[2];
>> $resource_bound = $results[3];
>> $current_flops = $results[4];
>> 
>> echo "This workunit was bound to run in $max_time seconds - but died after
>> $elapsed_time seconds.\n";
>> echo "This fpops bound was {$resource_bound}G operations, and the client was
>> operating at {$current_flops}G ops/sec\n";
>> 
>> 
>> To get a host back 'normal' I'd detach, turn off boinc, delete it from the
>> database on the server, and remove all the files from the client (pretty
>> much everything in /var/lib/boinc/ like client_state.xml and so on).
>> 
>> I'm sure parts of this are completely overkill.
>> 
>> -tom
>> _______________________________________________
>> boinc_projects mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_projects
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
> 
> 

----------------------------------------------------------------------------------------------------------
Travis Desell                          <deselt @ cs.rpi.edu>                    
  1-518-867-1054
Adjunct Professor & Postdoctoral Research Assistant
Rensselaer Polytechnic Institute, 110 8th Street, Troy NY 12180, USA
http://www.cs.rpi.edu/~deselt/
MilkyWay@Home ( http://milkyway.cs.rpi.edu/ )
DNA@Home ( http://dnahome.cs.rpi.edu/ )
Worldwide Computing Laboratory ( http://wcl.cs.rpi.edu/ )
----------------------------------------------------------------------------------------------------------

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to