I am not sure how many cases there are... :) But I would like to mention a couple other points, on MW there is a machine that has somehow managed, in spite of the task limiters in the server, in accumulating over 3,000 tasks, of which 2,000+ are awaiting timeout which really screws things up ... a better generalized BOINC methodology in providing controls would possibly serve MW better if we can figure things out... especially in that the project limits based on the number of CPU cores regardless of where and how the tasks are actually processed ... so a 5870 in a quad has a limit of 24 tasks where my dual 5870s have an equally paltry 48 tasks because they are in an i7 system ... not because there are two GPUs as would be more appropriate.
Second, I know I made a long ago generalized comment that we do not track enough about failures ... this morning Distributed Data Mining demoed a tool that shows project-wide, by application, error rates so that Nico the Project Manager can tell if the applications are running better or worse ... I have, of course, encouraged him to submit the developed script ... you can see the tool here: http://ddm.nicoschlitter.de/DistributedDataMining/forum_thread.php?id=87 I guess my point is that this is the kind of question I would think that the projects should be monitoring ... but we also need something other than the occasional messages in the messages log or the injunction to search all projects every day to see if you are not turning in bad tasks... Another one of my long ago suggestions was better logging of completions and errors... BOINC VIew had a nice log that showed all reported tasks and from that you could see a history trend were you to peruse it ... I know BOINC is creating (by project) logs of some similar data in "Statistics.<project URL> and job_log.<project URL> where this type of data is saved ... but we don't really do anything with the data (well, Statistics drives the graphs) ... like a status page ... In the last 24 hours you have: Returned x successfully completed and y un-successfuly completed results for project A Yesterday you: ... Last week you: ... Last month you: ... Summary: Error rate for project A is x and is falling Error rate for project B is y and is Rising or something like that ... I know I am on the outer edge in that group of 3,000 or so that runs large numbers of projects, but that only means that we see some of the boundary issues that may or may not really bother those at the other boundary limit who only run one project ... and may only start to surface for the middle group that runs 5-10 projects ... heck I had a posting from a user that only runs one CPU and one GPU project per machine because of the problems that person has seen with BOINC when he tries to run more than that ... Anyway, this is part and parcel of those issues that have been raised in the past under the generalized rubric "Ease of Use" ... it is just too darn hard to know for sure if BOINC is working as it should unless you do some intense study ... and even I with much time on my hands can always manage to capture an issue in a timely manner ... On May 25, 2010, at 1:20 PM, Richard Haselgrove wrote: > There are indeed two issues here, but I'd categorise them differently. > > The SETI -9 tasks are really difficult, because the insane science > application produces outputs which appear to the BOINC Client to be > plausible. It's only much, much further down the line that the failure to > validate exposes the error. I think SETI may have to wrestle with this one > on its own. > > But I think Maureen is talking about other projects, where the problem is > indeed one of crashes and abnormal (non-zero status) exits which the BOINC > Client *does* interpret as a computation error. > > Part of the trouble here is that the BOINC message log (without special > logging flags) tends only to mention 'Output file absent'. It takes quite > sophisticated knowledge of BOINC to understand that 'Output file absent' > almost invariably means that the science application had previously crashed. > I've written that repeatedly on project message boards (including BOINC's > own message board): I could have sworn I'd also written about it quite > recently on one of these BOINC mailing lists, but none of my search tools is > turning up the reference. > > I think it would help if the BOINC client message log could actually use the > word 'error', as in > > "Computation for task xxxxxxxxxxxxxxx finished with an error > "Exit status 3 (0x3) for task xxxxxxxxxxxxxxx" > > BoincView can log them - why not BOINC itself? > > ----- Original Message ----- > From: "Lynn W. Taylor" <[email protected]> > To: <[email protected]> > Sent: Tuesday, May 25, 2010 8:24 PM > Subject: Re: [boinc_dev] host punishment mechanism revisited > > >> There are two big issues here. >> >> First, we aren't really talking about work that has "crashed" -- we may >> be able to tell that the work unit finished early, but work like the >> SETI -9 results didn't "crash" they just had a lot of signals. >> >> Run time isn't necessarily an indicator of quality. >> >> What we're talking about is work that does not validate -- where the >> result is compared to work done on another machine, and the results >> don't match. >> >> Notice has to get from the validator, travel by some means to the >> eyeballs of someone who cares about that machine, and register with >> their Mark I mod 0 brain. >> >> The two issues are: >> >> What if the back-channel to the user is not available (old E-Mail >> address, or not running BOINCMGR)? >> >> What if the user is (obscure reference to DNA, since this is Towel Day) >> missing, presumed fed? >> >> There is probably an argument that BOINC should shut down gracefully if >> the machine owner doesn't verify his continued existence periodically. >> >> On 5/25/2010 12:07 PM, Maureen Vilar wrote: >> >>> When computation finishes prematurely would it be possible to add to the >>> messages something like: 'This task crashed'? And even 'Ask for advice on >>> the project forum'? >> _______________________________________________ >> boinc_dev mailing list >> [email protected] >> http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev >> To unsubscribe, visit the above URL and >> (near bottom of page) enter your email address. >> > > > _______________________________________________ > boinc_dev mailing list > [email protected] > http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev > To unsubscribe, visit the above URL and > (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
