Re: [boinc_dev] host punishment mechanism revisited

Paul D. Buck Wed, 26 May 2010 06:16:06 -0700

While I was trying to fall asleep I reviewed this post and knew I had not made 
one of the critical points...


If, on the server side, the error rates of the various applications is tracked 
(as I note Nico is doing), then the individual host rate can be checked against 
the average and a threshold could then be used to determine if the host should 
be restricted... however, the other main point is still valid, there needs to 
be a better place or mechanism to notify the participant that there may be an 
issue...

On May 25, 2010, at 8:43 PM, Paul D. Buck wrote:

> I am not sure how many cases there are... :)
> 
> But I would like to mention a couple other points, on MW there is a machine 
> that has somehow managed, in spite of the task limiters in the server, in 
> accumulating over 3,000 tasks, of which 2,000+ are awaiting timeout which 
> really screws things up ... a better generalized BOINC methodology in 
> providing controls would possibly serve MW better if we can figure things 
> out... especially in that the project limits based on the number of CPU cores 
> regardless of where and how the tasks are actually processed ... so a 5870 in 
> a quad has a limit of 24 tasks where my dual 5870s have an equally paltry 48 
> tasks because they are in an i7 system ... not because there are two GPUs as 
> would be more appropriate.
> 
> Second, I know I made a long ago generalized comment that we do not track 
> enough about failures ... this morning Distributed Data Mining demoed a tool 
> that shows project-wide, by application, error rates so that Nico the Project 
> Manager can tell if the applications are running better or worse ... I have, 
> of course, encouraged him to submit the developed script ... you can see the 
> tool here: 
> http://ddm.nicoschlitter.de/DistributedDataMining/forum_thread.php?id=87
> 
> I guess my point is that this is the kind of question I would think that the 
> projects should be monitoring ... but we also need something other than the 
> occasional messages in the messages log or the injunction to search all 
> projects every day to see if you are not turning in bad tasks...
> 
> Another one of my long ago suggestions was better logging of completions and 
> errors... BOINC VIew had a nice log that showed all reported tasks and from 
> that you could see a history trend were you to peruse it ... I know BOINC is 
> creating (by project) logs of some similar data in "Statistics.<project URL> 
> and job_log.<project URL> where this type of data is saved ... but we don't 
> really do anything with the data (well, Statistics drives the graphs) ... 
> like a status page ...
> 
> In the last 24 hours you have:
> Returned x successfully completed and y un-successfuly completed  results for 
> project A
> 
> Yesterday you:
> ...
> 
> Last week you:
> ...
> Last month you:
> ...
> 
> Summary:
> Error rate for project A is x and is falling
> Error rate for project B is y and is Rising
> 
> or something like that ...
> 
> I know I am on the outer edge in that group of 3,000 or so that runs large 
> numbers of projects, but that only means that we see some of the boundary 
> issues that may or may not really bother those at the other boundary limit 
> who only run one project ... and may only start to surface for the middle 
> group that runs 5-10 projects ... heck I had a posting from a user that only 
> runs one CPU and one GPU project per machine because of the problems that 
> person has seen with BOINC when he tries to run more than that ...
> 
> Anyway, this is part and parcel of those issues that have been raised in the 
> past under the generalized rubric "Ease of Use" ... it is just too darn hard 
> to know for sure if BOINC is working as it should unless you do some intense 
> study ... and even I with much time on my hands can always manage to capture 
> an issue in a timely manner ...
> 
> On May 25, 2010, at 1:20 PM, Richard Haselgrove wrote:
> 
>> There are indeed two issues here, but I'd categorise them differently.
>> 
>> The SETI -9 tasks are really difficult, because the insane science
>> application produces outputs which appear to the BOINC Client to be
>> plausible. It's only much, much further down the line that the failure to
>> validate exposes the error. I think SETI may have to wrestle with this one
>> on its own.
>> 
>> But I think Maureen is talking about other projects, where the problem is
>> indeed one of crashes and abnormal (non-zero status) exits which the BOINC
>> Client *does* interpret as a computation error.
>> 
>> Part of the trouble here is that the BOINC message log (without special
>> logging flags) tends only to mention 'Output file absent'. It takes quite
>> sophisticated knowledge of BOINC to understand that 'Output file absent'
>> almost invariably means that the science application had previously crashed.
>> I've written that repeatedly on project message boards (including BOINC's
>> own message board): I could have sworn I'd also written about it quite
>> recently on one of these BOINC mailing lists, but none of my search tools is
>> turning up the reference.
>> 
>> I think it would help if the BOINC client message log could actually use the
>> word 'error', as in
>> 
>> "Computation for task xxxxxxxxxxxxxxx finished with an error
>> "Exit status 3 (0x3) for task xxxxxxxxxxxxxxx"
>> 
>> BoincView can log them - why not BOINC itself?
>> 
>> ----- Original Message ----- 
>> From: "Lynn W. Taylor" <[email protected]>
>> To: <[email protected]>
>> Sent: Tuesday, May 25, 2010 8:24 PM
>> Subject: Re: [boinc_dev] host punishment mechanism revisited
>> 
>> 
>>> There are two big issues here.
>>> 
>>> First, we aren't really talking about work that has "crashed" -- we may
>>> be able to tell that the work unit finished early, but work like the
>>> SETI -9 results didn't "crash" they just had a lot of signals.
>>> 
>>> Run time isn't necessarily an indicator of quality.
>>> 
>>> What we're talking about is work that does not validate -- where the
>>> result is compared to work done on another machine, and the results
>>> don't match.
>>> 
>>> Notice has to get from the validator, travel by some means to the
>>> eyeballs of someone who cares about that machine, and register with
>>> their Mark I mod 0 brain.
>>> 
>>> The two issues are:
>>> 
>>> What if the back-channel to the user is not available (old E-Mail
>>> address, or not running BOINCMGR)?
>>> 
>>> What if the user is (obscure reference to DNA, since this is Towel Day)
>>> missing, presumed fed?
>>> 
>>> There is probably an argument that BOINC should shut down gracefully if
>>> the machine owner doesn't verify his continued existence periodically.
>>> 
>>> On 5/25/2010 12:07 PM, Maureen Vilar wrote:
>>> 
>>>> When computation finishes prematurely would it be possible to add to the
>>>> messages something like: 'This task crashed'? And even 'Ask for advice on
>>>> the project forum'?
>>> _______________________________________________
>>> boinc_dev mailing list
>>> [email protected]
>>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>>> To unsubscribe, visit the above URL and
>>> (near bottom of page) enter your email address.
>>> 
>> 
>> 
>> _______________________________________________
>> boinc_dev mailing list
>> [email protected]
>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
>> To unsubscribe, visit the above URL and
>> (near bottom of page) enter your email address.
> 
> _______________________________________________
> boinc_dev mailing list
> [email protected]
> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
> To unsubscribe, visit the above URL and
> (near bottom of page) enter your email address.

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] host punishment mechanism revisited

Reply via email to