On Thu, Jul 24, 2003 at 11:43:02AM -0700, Rudi Cilibrasi wrote:
> I've just added a graph to illustrate this point in more detail: look
> at the bottom part of:
> 
> http://homepages.cwi.nl/~cilibrar/ngrouting/
> 
> Here, you can see in the northwest corner the problem with the BDA-style
> algorithms: They see an aberration, and have reapplied that guess in the
> next 4 green crosses (or the next blue square), even though that point
> didn't "make sense" with regards to the rest of the data.

This graph is very interesting.  Presumably the reason this occured in
this case was because, if I understand your code correctly, rather than
interpolating between the two nearest bucket averages for which data
exists, the current BDA implementation just goes with the nearest -
which, unfortunately, for documents between 0 and 5 in size - is an
outlier.  Making the algorithm interpolate (as the RoutingTimeEstimator
does) would probably have resulted in much more favourable results for
BDA1.  

I also think this explains BDA2's strangely superior performance - which
is basically due to the blind luck that its default value of 500 happens
to be much closer to the 2nd and 3rd data points.

Additionally, it isn't very clear whether any of these algorithms are
showing much evidence of useful generalization of the data, most of the
data points seem to be spread between 500 and 2500, if anything SVM
seems to be following the actual data *too* closely.  I suspect that a
version of BDA1 with interpolation might actually do a better job of 
ignoring random fluctuations in the sample data.

>  This is why
> it doesn't make sense to only use SVM's when a certain crucial data
> threshhold is reached -- the counter-intuitive truth is that seemingly
> simpler and "more reliable" methods like exponential decay can wind
> up getting confused early and staying confused more, because they cannot
> differentiate model from noise.

I am afraid that I don't think that this data supports that hypothesis -
rather I think the problem here is caused by not interpolating between
the bucket-averages.

>  You can see this same problem again
> in sample 12 or so, and again near sample 16, and again at sample
> 55 or so.  In reality, these algorithms are less reliable.

I think it is clear that these algorithms are definitely less reliable 
without interpolation between the bucket-averages, but any reasonable 
BDA implementation would employ interpolation - so I am not sure that we 
can accept, yet, that SVMs are superior to a simpler approach.

Ian.

-- 
Ian Clarke                                                  [EMAIL PROTECTED]
Coordinator, The Freenet Project              http://freenetproject.org/
Founder, Locutus                                        http://locut.us/
Personal Homepage                                   http://locut.us/ian/

Attachment: pgp00000.pgp
Description: PGP signature

Reply via email to