On Thu, Jul 24, 2003 at 11:43:02AM -0700, Rudi Cilibrasi wrote: > I've just added a graph to illustrate this point in more detail: look > at the bottom part of: > > http://homepages.cwi.nl/~cilibrar/ngrouting/ > > Here, you can see in the northwest corner the problem with the BDA-style > algorithms: They see an aberration, and have reapplied that guess in the > next 4 green crosses (or the next blue square), even though that point > didn't "make sense" with regards to the rest of the data.
This graph is very interesting. Presumably the reason this occured in this case was because, if I understand your code correctly, rather than interpolating between the two nearest bucket averages for which data exists, the current BDA implementation just goes with the nearest - which, unfortunately, for documents between 0 and 5 in size - is an outlier. Making the algorithm interpolate (as the RoutingTimeEstimator does) would probably have resulted in much more favourable results for BDA1. I also think this explains BDA2's strangely superior performance - which is basically due to the blind luck that its default value of 500 happens to be much closer to the 2nd and 3rd data points. Additionally, it isn't very clear whether any of these algorithms are showing much evidence of useful generalization of the data, most of the data points seem to be spread between 500 and 2500, if anything SVM seems to be following the actual data *too* closely. I suspect that a version of BDA1 with interpolation might actually do a better job of ignoring random fluctuations in the sample data. > This is why > it doesn't make sense to only use SVM's when a certain crucial data > threshhold is reached -- the counter-intuitive truth is that seemingly > simpler and "more reliable" methods like exponential decay can wind > up getting confused early and staying confused more, because they cannot > differentiate model from noise. I am afraid that I don't think that this data supports that hypothesis - rather I think the problem here is caused by not interpolating between the bucket-averages. > You can see this same problem again > in sample 12 or so, and again near sample 16, and again at sample > 55 or so. In reality, these algorithms are less reliable. I think it is clear that these algorithms are definitely less reliable without interpolation between the bucket-averages, but any reasonable BDA implementation would employ interpolation - so I am not sure that we can accept, yet, that SVMs are superior to a simpler approach. Ian. -- Ian Clarke [EMAIL PROTECTED] Coordinator, The Freenet Project http://freenetproject.org/ Founder, Locutus http://locut.us/ Personal Homepage http://locut.us/ian/
pgp00000.pgp
Description: PGP signature
