OK, I was mis-reading the problem. I thought you were getting one task sent out when you were expecting 0. Getting 2 tasks sent (#s 1 and 2 where the original was #0) when you have a replication of 1 looks like a bug.
-----Original Message----- From: Slawomir Rzeznicki [mailto:[email protected]] Sent: Wednesday, March 13, 2013 10:50 AM To: McLeod, John; BOINC_dev Subject: Re: [boinc_dev] Quorum 1, multiple results spawned So what's the correct way to mark the result as invalid when quorum = 1 ? The only other way I found is to mark it with validate error. I use the 'validate error' option to mark results which had any kind of corruption in the output file (missing file, damaged file, damaged structure, unexpected data etc). This tells other daemons that the result file is known to be bad so they just ignore the result. Note: results with 'validate error' did not spawn more than 1 new result. For me 'invalid' means that the output file is readable, data structure is fine and can be parsed by validator, however something is wrong with the output. Here at Enigma@Home it usually happens when the client runs a broken app or the host has serious hardware problems (faulty RAM frequently results in pretty much random data written to the output). Here is an example of outcome = 5, validate_state = 2 when running 'fixed' validator. Without increasing the target_nresults only 1 more result was spawned and then correctly validated on return. This is what I'd expect. http://www.enigmaathome.net/static/3202xb55id/invalidgood.png And here is the same thing, outcome = 5, validate_state = 2 when running stock validator. http://www.enigmaathome.net/static/3201xdqtwd/invalidbad.png This one will end up with 3 possible cases: http://www.enigmaathome.net/static/3203xd8zgu/invalidend1.png http://www.enigmaathome.net/static/3204xguazq/invalidend2.png The third one is first result validated, second marked as 'didn't need' because the server does not wait forever when the quorum is met (I think it waits for 6 hours then it's too late). Btw, I had exactly the same problem at radioactive@home. Radioactive@home only collects data so there is no need for any resends if the WU fails, basically the workunit is just a name with a number assigned, no input files of any kind. At first I marked bad results with validate error, the server then generated a resend (which is unnecessary, but ok). I had to fall back to 'Invalid' after I received a couple of questions, because people were actually thinking it's the validator that's failing with 'validate error'. So eventually I ended up exactly with the same problem, multiple results spawned for a single workunit and this was very bad. The workaround I used there was setting 'max error results' to 1, so eventually after first error the entire workunit is marked as bad. Regards, Slawomir Rzeznicki http://www.enigmaathome.net ----- Original Message ----- From: "McLeod, John" <[email protected]> To: "Slawomir Rzeznicki" <[email protected]>; "BOINC_dev" <[email protected]> Sent: Wednesday, March 13, 2013 2:55 PM Subject: RE: [boinc_dev] Quorum 1, multiple results spawned The problem is that you are marking the result as invalid. This is a signal to the system to try again. Invalid does not mean a negative result, it means the result failed to compute correctly. -----Original Message----- From: boinc_dev [mailto:[email protected]] On Behalf Of Slawomir Rzeznicki Sent: Tuesday, March 12, 2013 7:56 AM To: BOINC_dev Subject: [boinc_dev] Quorum 1, multiple results spawned Hello, Recently I've ran into a problem at Enigma@Home. The project uses quorum = 1, the other result settings are: minimum quorum 1 initial replication 1 max # of error/total/success tasks 3, 10, 6 Everything is fine until the server receives a success result which is then marked as invalid by the validator. The validator leaves the outcome at RESULT_OUTCOME_SUCCESS and sets validate_state to VALIDATE_STATE_INVALID. The result is then marked as 'Completed, marked as invalid' just like I want. The problem is that the validator bumps the target_nresults by 1 which results in spawning two results instead of one. Now theoretically it's not a problem, here at Enigma@Home each result runs on it's own and it doesn't really matter if I use 1 workunit -> 1 result or 1 workunit -> many results (I use 1:1 mainly because it makes validation easier). However due to random host turnaround times, some of the 'bonus work' will be cancelled by server (if one of the hosts is way faster and the other one contacts the server) or, what's worse, it will be wasted if the other host does not contact the server for a long period of time. The source of the problem is in validator.cpp line 586: // if #success results >= target_nresults, // we need more results, so bump target_nresults // NOTE: nsuccess_results should never be > target_nresults, // but accommodate that if it should happen // if (nsuccess_results >= wu.target_nresults) { wu.target_nresults = nsuccess_results+1; transition_time = IMMEDIATE; } Just for test I commented these lines and the problem is gone. Does it count as a bug, at least when the project uses quorum = 1 ? Regards, Slawomir Rzeznicki http://www.enigmaathome.net _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address. _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
