Re: [boinc_dev] host punishment mechanism revisited

Paul D. Buck Tue, 25 May 2010 20:43:55 -0700

I am not sure how many cases there are... :)

But I would like to mention a couple other points, on MW there is a machine 
that has somehow managed, in spite of the task limiters in the server, in 
accumulating over 3,000 tasks, of which 2,000+ are awaiting timeout which 
really screws things up ... a better generalized BOINC methodology in providing 
controls would possibly serve MW better if we can figure things out... 
especially in that the project limits based on the number of CPU cores 
regardless of where and how the tasks are actually processed ... so a 5870 in a 
quad has a limit of 24 tasks where my dual 5870s have an equally paltry 48 
tasks because they are in an i7 system ... not because there are two GPUs as 
would be more appropriate.

Second, I know I made a long ago generalized comment that we do not track 
enough about failures ... this morning Distributed Data Mining demoed a tool 
that shows project-wide, by application, error rates so that Nico the Project 
Manager can tell if the applications are running better or worse ... I have, of 
course, encouraged him to submit the developed script ... you can see the tool 
here: http://ddm.nicoschlitter.de/DistributedDataMining/forum_thread.php?id=87

I guess my point is that this is the kind of question I would think that the 
projects should be monitoring ... but we also need something other than the 
occasional messages in the messages log or the injunction to search all 
projects every day to see if you are not turning in bad tasks...

Another one of my long ago suggestions was better logging of completions and 
errors... BOINC VIew had a nice log that showed all reported tasks and from 
that you could see a history trend were you to peruse it ... I know BOINC is 
creating (by project) logs of some similar data in "Statistics.<project URL> 
and job_log.<project URL> where this type of data is saved ... but we don't 
really do anything with the data (well, Statistics drives the graphs) ... like 
a status page ...

In the last 24 hours you have:
Returned x successfully completed and y un-successfuly completed  results for 
project A

Yesterday you:
...

Last week you:
...
Last month you:
...

Summary:
Error rate for project A is x and is falling
Error rate for project B is y and is Rising

or something like that ...

I know I am on the outer edge in that group of 3,000 or so that runs large 
numbers of projects, but that only means that we see some of the boundary 
issues that may or may not really bother those at the other boundary limit who 
only run one project ... and may only start to surface for the middle group 
that runs 5-10 projects ... heck I had a posting from a user that only runs one 
CPU and one GPU project per machine because of the problems that person has 
seen with BOINC when he tries to run more than that ...

Anyway, this is part and parcel of those issues that have been raised in the 
past under the generalized rubric "Ease of Use" ... it is just too darn hard to 
know for sure if BOINC is working as it should unless you do some intense study 
... and even I with much time on my hands can always manage to capture an issue 
in a timely manner ...

On May 25, 2010, at 1:20 PM, Richard Haselgrove wrote:

> There are indeed two issues here, but I'd categorise them differently.
> 
> The SETI -9 tasks are really difficult, because the insane science
> application produces outputs which appear to the BOINC Client to be
> plausible. It's only much, much further down the line that the failure to
> validate exposes the error. I think SETI may have to wrestle with this one
> on its own.
> 
> But I think Maureen is talking about other projects, where the problem is
> indeed one of crashes and abnormal (non-zero status) exits which the BOINC
> Client *does* interpret as a computation error.
> 
> Part of the trouble here is that the BOINC message log (without special
> logging flags) tends only to mention 'Output file absent'. It takes quite
> sophisticated knowledge of BOINC to understand that 'Output file absent'
> almost invariably means that the science application had previously crashed.
> I've written that repeatedly on project message boards (including BOINC's
> own message board): I could have sworn I'd also written about it quite
> recently on one of these BOINC mailing lists, but none of my search tools is
> turning up the reference.
> 
> I think it would help if the BOINC client message log could actually use the
> word 'error', as in
> 
> "Computation for task xxxxxxxxxxxxxxx finished with an error
> "Exit status 3 (0x3) for task xxxxxxxxxxxxxxx"
> 
> BoincView can log them - why not BOINC itself?
> 
> ----- Original Message ----- 
> From: "Lynn W. Taylor" <[email protected]>
> To: <[email protected]>
> Sent: Tuesday, May 25, 2010 8:24 PM
> Subject: Re: [boinc_dev] host punishment mechanism revisited
> 
> 
>> There are two big issues here.
>> 
>> First, we aren't really talking about work that has "crashed" -- we may
>> be able to tell that the work unit finished early, but work like the
>> SETI -9 results didn't "crash" they just had a lot of signals.
>> 
>> Run time isn't necessarily an indicator of quality.
>> 
>> What we're talking about is work that does not validate -- where the
>> result is compared to work done on another machine, and the results
>> don't match.
>> 
>> Notice has to get from the validator, travel by some means to the
>> eyeballs of someone who cares about that machine, and register with
>> their Mark I mod 0 brain.
>> 
>> The two issues are:
>> 
>> What if the back-channel to the user is not available (old E-Mail
>> address, or not running BOINCMGR)?
>> 
>> What if the user is (obscure reference to DNA, since this is Towel Day)
>> missing, presumed fed?
>> 
>> There is probably an argument that BOINC should shut down gracefully if
>> the machine owner doesn't verify his continued existence periodically.
>> 
>> On 5/25/2010 12:07 PM, Maureen Vilar wrote:
>> 
>>> When computation finishes prematurely would it be possible to add to the
>>> messages something like: 'This task crashed'? And even 'Ask for advice on
>>> the project forum'?
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
>> 
> 
> 
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] host punishment mechanism revisited

Reply via email to