On Thu, Jul 24, 2003 at 01:47:29PM -0700, Rudi Cilibrasi wrote:
> Though it does improve things, it still doesn't compete with SVM.  I've
> updated my webpage to include a third candidate according to your
> specification, called BDA3.  As you can see, it's still got a big error
> as the graph shows, even for bins it knows about.

Generally speaking, I am not sure we can really learn anything all that
useful from comparisons of SVM with BDA algorithms over data that is far
too sparse to compare realistically with the type of data we can
anticipate in-practice.  These algorithms will only really be tested
when they are forced to generalize for large datasets with large amounts
of noise.  In this case it is clear that SVM has just "learned" the 
data, it hasn't been forced to generalize.

A useful test would involve a direct comparison of the
RoutingTimeEstimator's performance with SVM using realistic data, I will
see if I can contact Hui to get the data I collected for him several
months ago.

> In this case, at least, it certainly doesn't.  If I am not doing something
> right in the interpolation, I invite you to try a smarter algorithm.

Some of the points produced by the BDA3 implementation are suspect (such
as the one around [7,10100]), but my core concern is that the dataset is
far too small, and that none of these algorithms are being forced to
generalize.

> My feeling is no simple algorithm any of us makes up will do as well
> as the most obvious application of SVM.

Well, we will find out one way or the other, but I don't think that this
test supports either case.  The correct generalization for this
experiment would probably have been a gently curving downward slope from
about 1600 at size 0 asymptotically approaching about 800 as the
document size increases.  In this instance, SVM is almost too smart for
its own good - as it closely tracks the random fluctuations in the data,
clearly not identifying that they are noise.  Only with a much more
dense dataset would we be able to see whether SVM is capable of
recognizing that these fluctuations really aren't significant - but in
this case I can think of no reason that the RTE implementation couldn't
do just as good a job.

Fortunately, we don't have to rely on feelings - we can do empirical
tests and find out which is better, and how much better it is.

> I am curious if you still believe this, in view of the new experiments I've
> just run.

I am afraid that I am still skeptical for the following reasons (in
descending order of significance):

* Realistic datasets will be much larger and more noisy than this 
  dataset, the fact that SVM can track the random fluctuations so 
  closely indicates that it has just "learned" the data set, it hasn't
  generalized to any underlying structure, it wouldn't have that luxury
  with a realistic dataset which will have tens of thousands of samples

* Some of the points produced by the current BDA3 implementation are a 
  bit suspect making me wonder whether there might be a bug - this, 
  however - isn't my main concern

* Even if BDA3 is working as it should, our RTE algorithm is still more 
  powerful, only a head-to-head comparison between RTE and SVM will be
  a fair test

I really feel that the only fair comparison will be with a large
realistic dataset, and being compared against the actual algorithm we 
have developed for this purpose, for which even BDA3 is a primitive 
substitute.  I will try to extract such a dataset from Hui so that we 
can make some progress here.

Ian.

-- 
Ian Clarke                                                  [EMAIL PROTECTED]
Coordinator, The Freenet Project              http://freenetproject.org/
Founder, Locutus                                        http://locut.us/
Personal Homepage                                   http://locut.us/ian/

Attachment: pgp00000.pgp
Description: PGP signature

Reply via email to