On Thu, Jul 24, 2003 at 01:47:29PM -0700, Rudi Cilibrasi wrote: > Though it does improve things, it still doesn't compete with SVM. I've > updated my webpage to include a third candidate according to your > specification, called BDA3. As you can see, it's still got a big error > as the graph shows, even for bins it knows about.
Generally speaking, I am not sure we can really learn anything all that useful from comparisons of SVM with BDA algorithms over data that is far too sparse to compare realistically with the type of data we can anticipate in-practice. These algorithms will only really be tested when they are forced to generalize for large datasets with large amounts of noise. In this case it is clear that SVM has just "learned" the data, it hasn't been forced to generalize. A useful test would involve a direct comparison of the RoutingTimeEstimator's performance with SVM using realistic data, I will see if I can contact Hui to get the data I collected for him several months ago. > In this case, at least, it certainly doesn't. If I am not doing something > right in the interpolation, I invite you to try a smarter algorithm. Some of the points produced by the BDA3 implementation are suspect (such as the one around [7,10100]), but my core concern is that the dataset is far too small, and that none of these algorithms are being forced to generalize. > My feeling is no simple algorithm any of us makes up will do as well > as the most obvious application of SVM. Well, we will find out one way or the other, but I don't think that this test supports either case. The correct generalization for this experiment would probably have been a gently curving downward slope from about 1600 at size 0 asymptotically approaching about 800 as the document size increases. In this instance, SVM is almost too smart for its own good - as it closely tracks the random fluctuations in the data, clearly not identifying that they are noise. Only with a much more dense dataset would we be able to see whether SVM is capable of recognizing that these fluctuations really aren't significant - but in this case I can think of no reason that the RTE implementation couldn't do just as good a job. Fortunately, we don't have to rely on feelings - we can do empirical tests and find out which is better, and how much better it is. > I am curious if you still believe this, in view of the new experiments I've > just run. I am afraid that I am still skeptical for the following reasons (in descending order of significance): * Realistic datasets will be much larger and more noisy than this dataset, the fact that SVM can track the random fluctuations so closely indicates that it has just "learned" the data set, it hasn't generalized to any underlying structure, it wouldn't have that luxury with a realistic dataset which will have tens of thousands of samples * Some of the points produced by the current BDA3 implementation are a bit suspect making me wonder whether there might be a bug - this, however - isn't my main concern * Even if BDA3 is working as it should, our RTE algorithm is still more powerful, only a head-to-head comparison between RTE and SVM will be a fair test I really feel that the only fair comparison will be with a large realistic dataset, and being compared against the actual algorithm we have developed for this purpose, for which even BDA3 is a primitive substitute. I will try to extract such a dataset from Hui so that we can make some progress here. Ian. -- Ian Clarke [EMAIL PROTECTED] Coordinator, The Freenet Project http://freenetproject.org/ Founder, Locutus http://locut.us/ Personal Homepage http://locut.us/ian/
pgp00000.pgp
Description: PGP signature
