Adaptive replication should track a machines validation and error history.
Machines that have high error rates (and the machine you are describing has
a high error rate) will have a very low chance of running without
validation. On the other hand machines that never have validation errors
will have a very high chance of running solo.
The way I would do it is to store a success fraction per computer (1 -
(errors + aborts + invalid)/total tasks). The calculation of whether to
actually issue another task after this one would be: (R - (N + 1))*F*C
where R is the replication level requested by the project (one based), and
N is the replication number of this replication (0 based), and F is the
Success Fraction for this project on this computer, and C is some constant
to prevent computers have regular errors from ever running solo. Since (R
- (N + 1)) is 0 for the last requested replicant, no others will be issued
unless there is an error or late task. If C is 10, then only tasks that
have better than 90% success rate will EVER run solo in a 2 replicant
system. C could be a project setting, but it should never be allowed to be
set to less than 1. Arguably, 10 is about right.
jm7
Raistmer
<[email protected]
> To
Sent by: "David Anderson"
<boinc_dev-bounce <[email protected]>, "John
[email protected] Keck" <[email protected]>
u> cc
BOINC Developers Mailing List
<[email protected]>
11/08/2009 03:44 Subject
AM Re: [boinc_dev] [boinc_projects]
new credit design
> The following 3 mechanisms work together:
>
> 1) adaptive replication (to reject wrong results with high probability)
> Possibly with an added app-specific consistency check as John
suggests.
> Hopefully they'll become the defaults.
>
> - DPA
>
Truly hope this doesn't become default.
Adaptive replication completely ignore random events.
It's just too clumsy to react on such events. For example, I have GPU that
after many hours of work can start to produce overflows on SETI. No single
error reported in stderr, it just starts to process junk and to find many
signals there.
After reboot it again behave as good device. What will be in science
database from such devices w/o redundancy at least of 2 ?
All that junk will go directly into database cause host returns also many
results from other devices so it has enough time to gain server trust (just
to be rude deceived ). This is a situation where _many_ (!) invalid results
can go into database w/o validation.
Of course there are invalid results from borderline OCed systems, 1-2
invalids per week quite possible.
If you will make trust gain conditions too hard, it will not bring much
performance benefits while will have all the same defects. Because even
totally trusted host can start to produce invalid results at some point of
time (trivial - dust in fan ;) )(sanity check will not help at all cause if
you process by correct rules but wrong data there are very small chances it
can be noticed w/o complete comparison with another returned result).
And if you will make trust gain conditions easy projects' databases will
overflowed with invalid results.
Cause in my case for example to generate invalid overflow it takes ~26
seconds while to process task it takes ~100 minutes.
Hope you get any impressions what will be if some high-end GPU will go
mad.... (my is low end of course).
And again, to say "this overflow is not valid one" you need to check it
against another result, overflow by itself (-9) completely legal on SETI
MB.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.