So what's the correct way to mark the result as invalid when quorum = 1 ?
The only other way I found is to mark it with validate error.
I use the 'validate error' option to mark results which had any kind of
corruption in the output
file (missing file, damaged file, damaged structure, unexpected data etc).
This tells other daemons
that the result file is known to be bad so they just ignore the result.
Note: results with 'validate error'
did not spawn more than 1 new result.
For me 'invalid' means that the output file is readable, data structure is
fine and can be parsed by validator,
however something is wrong with the output. Here at Enigma@Home it usually
happens when the client
runs a broken app or the host has serious hardware problems (faulty RAM
frequently results in pretty much
random data written to the output).
Here is an example of outcome = 5, validate_state = 2 when running 'fixed'
validator.
Without increasing the target_nresults only 1 more result was spawned and
then correctly validated on return.
This is what I'd expect.
http://www.enigmaathome.net/static/3202xb55id/invalidgood.png
And here is the same thing, outcome = 5, validate_state = 2 when running
stock validator.
http://www.enigmaathome.net/static/3201xdqtwd/invalidbad.png
This one will end up with 3 possible cases:
http://www.enigmaathome.net/static/3203xd8zgu/invalidend1.png
http://www.enigmaathome.net/static/3204xguazq/invalidend2.png
The third one is first result validated, second marked as 'didn't need'
because the server does not wait forever
when the quorum is met (I think it waits for 6 hours then it's too late).
Btw, I had exactly the same problem at radioactive@home.
Radioactive@home only collects data so there is no need for any resends if
the WU fails, basically the
workunit is just a name with a number assigned, no input files of any kind.
At first I marked bad results with validate error, the server then generated
a resend (which is unnecessary, but ok).
I had to fall back to 'Invalid' after I received a couple of questions,
because people were actually thinking it's the
validator that's failing with 'validate error'.
So eventually I ended up exactly with the same problem, multiple results
spawned for a single workunit and
this was very bad. The workaround I used there was setting 'max error
results' to 1, so eventually after first error
the entire workunit is marked as bad.
Regards,
Slawomir Rzeznicki
http://www.enigmaathome.net
----- Original Message -----
From: "McLeod, John" <[email protected]>
To: "Slawomir Rzeznicki" <[email protected]>; "BOINC_dev"
<[email protected]>
Sent: Wednesday, March 13, 2013 2:55 PM
Subject: RE: [boinc_dev] Quorum 1, multiple results spawned
The problem is that you are marking the result as invalid. This is a signal
to the system to try again. Invalid does not mean a negative result, it
means the result failed to compute correctly.
-----Original Message-----
From: boinc_dev [mailto:[email protected]] On Behalf Of
Slawomir Rzeznicki
Sent: Tuesday, March 12, 2013 7:56 AM
To: BOINC_dev
Subject: [boinc_dev] Quorum 1, multiple results spawned
Hello,
Recently I've ran into a problem at Enigma@Home.
The project uses quorum = 1, the other result settings are:
minimum quorum 1
initial replication 1
max # of error/total/success tasks 3, 10, 6
Everything is fine until the server receives a success result which is then
marked as invalid by the validator.
The validator leaves the outcome at RESULT_OUTCOME_SUCCESS and sets
validate_state to
VALIDATE_STATE_INVALID.
The result is then marked as 'Completed, marked as invalid' just like I
want.
The problem is that the validator bumps the target_nresults by 1 which
results in spawning two results instead of one.
Now theoretically it's not a problem, here at Enigma@Home each result runs
on it's own and it doesn't really
matter if I use 1 workunit -> 1 result or 1 workunit -> many results (I use
1:1 mainly because it makes validation easier).
However due to random host turnaround times, some of the 'bonus work' will
be cancelled by server (if one of the hosts
is way faster and the other one contacts the server) or, what's worse, it
will be wasted if the other host does not contact
the server for a long period of time.
The source of the problem is in validator.cpp line 586:
// if #success results >= target_nresults,
// we need more results, so bump target_nresults
// NOTE: nsuccess_results should never be > target_nresults,
// but accommodate that if it should happen
//
if (nsuccess_results >= wu.target_nresults) {
wu.target_nresults = nsuccess_results+1;
transition_time = IMMEDIATE;
}
Just for test I commented these lines and the problem is gone.
Does it count as a bug, at least when the project uses quorum = 1 ?
Regards,
Slawomir Rzeznicki
http://www.enigmaathome.net
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.