How much benefit can be derived from faster detection of problem hosts
is certainly important, and I don't have enough data to make any
judgement. Any potential change of course ought to be subject to a
cost/benefit anasysis.
My guess is that the idea of a pre-validation check of each returned
result would definitely be worthwhile for projects which can detect a
significant fraction of problems that way. It could be just an additional
validate_state value, the transition would be from VALIDATE_STATE_INIT to
a new prechecked state for each result received before a quorum can be
formed. For s...@h it's unlikely to be useful, so the pre-check code would
trivially change that status and return.
My idea of becoming suspicious about a host when it's performance differs
significantly from earlier stats may or may not be useful even at s...@h. For
the transition to twice as many chirp steps a few months ago there was no
corresponding threshold change, and the rate of result_overflow tasks just
about doubled. One might think that implies that the incidence of overflows
due to the CUDA runaway state is such a small fraction it is negligible, but
AFAIK that rate is based on the assimilated results. Since even two CUDA
hosts in runaway state don't produce results which are "strongly similar",
that statistic doesn't apply.
What really concerns me is that David is enthusiastic about adaptive
replication. If a CUDA host had established a good record and had several
thousand unreplicated tasks then went runaway, all those tasks would be
assimilated as canonical results. As David has pointed out, the false
positives can be eliminated by NTPCKR and subsequent examination, but the
record shows the subband and time period of each of those tasks as having
been checked when it actually hasn't.
Forgive me for my s...@h centric post, I try to consider how changes will be
useful for all projects but have insufficient knowledge to do so effectively.
In any case, BOINC's evolution ought to be toward faster, smarter, and more
efficient rather than the opposite.
--
Joe
On 27 May 2010 at 19:32, Lynn wrote:
> Seems to me that the really fast methods of catching a broken host
> aren't very good, and the good methods aren't very fast.
>
> Does BOINC need fast?
>
> On 5/27/2010 7:19 PM, Josef W. Segur wrote:
> > I certainly agree, and even were it a perfect method of finding problem
> > hosts the best method of minimizing their damage to the project is
> > uncertain. Your inputs along that line have seemed very good to me.
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.