While I was trying to fall asleep I reviewed this post and knew I had not made one of the critical points...
If, on the server side, the error rates of the various applications is tracked (as I note Nico is doing), then the individual host rate can be checked against the average and a threshold could then be used to determine if the host should be restricted... however, the other main point is still valid, there needs to be a better place or mechanism to notify the participant that there may be an issue... On May 25, 2010, at 8:43 PM, Paul D. Buck wrote: > I am not sure how many cases there are... :) > > But I would like to mention a couple other points, on MW there is a machine > that has somehow managed, in spite of the task limiters in the server, in > accumulating over 3,000 tasks, of which 2,000+ are awaiting timeout which > really screws things up ... a better generalized BOINC methodology in > providing controls would possibly serve MW better if we can figure things > out... especially in that the project limits based on the number of CPU cores > regardless of where and how the tasks are actually processed ... so a 5870 in > a quad has a limit of 24 tasks where my dual 5870s have an equally paltry 48 > tasks because they are in an i7 system ... not because there are two GPUs as > would be more appropriate. > > Second, I know I made a long ago generalized comment that we do not track > enough about failures ... this morning Distributed Data Mining demoed a tool > that shows project-wide, by application, error rates so that Nico the Project > Manager can tell if the applications are running better or worse ... I have, > of course, encouraged him to submit the developed script ... you can see the > tool here: > http://ddm.nicoschlitter.de/DistributedDataMining/forum_thread.php?id=87 > > I guess my point is that this is the kind of question I would think that the > projects should be monitoring ... but we also need something other than the > occasional messages in the messages log or the injunction to search all > projects every day to see if you are not turning in bad tasks... > > Another one of my long ago suggestions was better logging of completions and > errors... BOINC VIew had a nice log that showed all reported tasks and from > that you could see a history trend were you to peruse it ... I know BOINC is > creating (by project) logs of some similar data in "Statistics.<project URL> > and job_log.<project URL> where this type of data is saved ... but we don't > really do anything with the data (well, Statistics drives the graphs) ... > like a status page ... > > In the last 24 hours you have: > Returned x successfully completed and y un-successfuly completed results for > project A > > Yesterday you: > ... > > Last week you: > ... > Last month you: > ... > > Summary: > Error rate for project A is x and is falling > Error rate for project B is y and is Rising > > or something like that ... > > I know I am on the outer edge in that group of 3,000 or so that runs large > numbers of projects, but that only means that we see some of the boundary > issues that may or may not really bother those at the other boundary limit > who only run one project ... and may only start to surface for the middle > group that runs 5-10 projects ... heck I had a posting from a user that only > runs one CPU and one GPU project per machine because of the problems that > person has seen with BOINC when he tries to run more than that ... > > Anyway, this is part and parcel of those issues that have been raised in the > past under the generalized rubric "Ease of Use" ... it is just too darn hard > to know for sure if BOINC is working as it should unless you do some intense > study ... and even I with much time on my hands can always manage to capture > an issue in a timely manner ... > > On May 25, 2010, at 1:20 PM, Richard Haselgrove wrote: > >> There are indeed two issues here, but I'd categorise them differently. >> >> The SETI -9 tasks are really difficult, because the insane science >> application produces outputs which appear to the BOINC Client to be >> plausible. It's only much, much further down the line that the failure to >> validate exposes the error. I think SETI may have to wrestle with this one >> on its own. >> >> But I think Maureen is talking about other projects, where the problem is >> indeed one of crashes and abnormal (non-zero status) exits which the BOINC >> Client *does* interpret as a computation error. >> >> Part of the trouble here is that the BOINC message log (without special >> logging flags) tends only to mention 'Output file absent'. It takes quite >> sophisticated knowledge of BOINC to understand that 'Output file absent' >> almost invariably means that the science application had previously crashed. >> I've written that repeatedly on project message boards (including BOINC's >> own message board): I could have sworn I'd also written about it quite >> recently on one of these BOINC mailing lists, but none of my search tools is >> turning up the reference. >> >> I think it would help if the BOINC client message log could actually use the >> word 'error', as in >> >> "Computation for task xxxxxxxxxxxxxxx finished with an error >> "Exit status 3 (0x3) for task xxxxxxxxxxxxxxx" >> >> BoincView can log them - why not BOINC itself? >> >> ----- Original Message ----- >> From: "Lynn W. Taylor" <[email protected]> >> To: <[email protected]> >> Sent: Tuesday, May 25, 2010 8:24 PM >> Subject: Re: [boinc_dev] host punishment mechanism revisited >> >> >>> There are two big issues here. >>> >>> First, we aren't really talking about work that has "crashed" -- we may >>> be able to tell that the work unit finished early, but work like the >>> SETI -9 results didn't "crash" they just had a lot of signals. >>> >>> Run time isn't necessarily an indicator of quality. >>> >>> What we're talking about is work that does not validate -- where the >>> result is compared to work done on another machine, and the results >>> don't match. >>> >>> Notice has to get from the validator, travel by some means to the >>> eyeballs of someone who cares about that machine, and register with >>> their Mark I mod 0 brain. >>> >>> The two issues are: >>> >>> What if the back-channel to the user is not available (old E-Mail >>> address, or not running BOINCMGR)? >>> >>> What if the user is (obscure reference to DNA, since this is Towel Day) >>> missing, presumed fed? >>> >>> There is probably an argument that BOINC should shut down gracefully if >>> the machine owner doesn't verify his continued existence periodically. >>> >>> On 5/25/2010 12:07 PM, Maureen Vilar wrote: >>> >>>> When computation finishes prematurely would it be possible to add to the >>>> messages something like: 'This task crashed'? And even 'Ask for advice on >>>> the project forum'? >>> _______________________________________________ >>> boinc_dev mailing list >>> [email protected] >>> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >>> To unsubscribe, visit the above URL and >>> (near bottom of page) enter your email address. >>> >> >> >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
