> of noise.  In this case it is clear that SVM has just "learned" the 
> data, it hasn't been forced to generalize.

How can you say it isn't "generalizing" when it is, in fact, making
novel and good predictions in the first 20 datapoints, for example?
Many of those first 20 have no nearby bin information.  The algorithm is
absolutely generalizing.  "Overfitting" would only be a concern if I
wasn't structuring this as 99 seperate train-test cycles, so that
*every* training point is absolutely not included in the corresponding
test.  Each prediction is based only on the doc size for that sample
and all data points to the LEFT of the point to be predicted -- in
other words, there is no way to "memorize" the data the way I've set
this experiment up.  I am afraid that the graph has worked against me
here, because of the way it's drawn, it looks like the typical
"noisy signal" you will see in ML textbooks in which the "right
answer" is some line going smoothly in the middle.  In this case, that
visual intuition is misleading.

> A useful test would involve a direct comparison of the
> RoutingTimeEstimator's performance with SVM using realistic data, I will
> see if I can contact Hui to get the data I collected for him several
> months ago.

I think this is probably the next useful step also.

> > In this case, at least, it certainly doesn't.  If I am not doing something
> > right in the interpolation, I invite you to try a smarter algorithm.
> Some of the points produced by the BDA3 implementation are suspect (such
> as the one around [7,10100]), but my core concern is that the dataset is
> far too small, and that none of these algorithms are being forced to
> generalize.

If you look at the inp.dat dataset, the outlier at 1 is described as:
2395 61646 10133
If you look at the point at 7 that you are referring to:
17257 66052 1284
As you can see, the "size" given for point 7 is *higher* than the
outlier.  So any reasonable predictor based on any sort of averaging
scheme (weighted or otherwise) must return at *least* 10133; more,
if you support extrapolation, as opposed to safer interpolation.

In machine learning, many algorithms are what we call "PAC learners", which
means that given enough data, they will converge to right answers.  Of course,
averaging schemes can get the right answer if the rest of the internet
is a completely stationary source and you measure 1000000 points.
The tradeoff in learning comes when
your source is not stationary, but instead changes over time, for
instance because more people are using the internet year by year, or
a network cable gets cut somewhere and reroutes traffic, or whatever.  When
your underlying source changes faster than your learning algorithm adapts,
then you have a continuous failure to approximate the closest match.  In
other words, unless your algorithm can learn quickly enough, it is
going to not work well in cases that change too suddenly.  So you can
see that in real situations, we simply don't have access to more data
from the same mathematical entity.  I can collect 100 or 1000 samples in
2 minutes, but the next 2 minutes may have a different network environment.
So the primary measures of a learning or control technique are its
accuracy and the amount of data that is needed to achieve that accuracy.
That is why it is crucial that your choice of learning algorithm be
able to figure stuff out with as few examples as possible.  This is
one way of looking at an answer to your question of how to choose
appropriate decay coefficients -- adaptively, based on your estimation of
how rapidly the underlying source is changing.

> > My feeling is no simple algorithm any of us makes up will do as well
> > as the most obvious application of SVM.
> Well, we will find out one way or the other, but I don't think that this
> test supports either case.  The correct generalization for this
> experiment would probably have been a gently curving downward slope from
> about 1600 at size 0 asymptotically approaching about 800 as the
> document size increases.  In this instance, SVM is almost too smart for
> its own good - as it closely tracks the random fluctuations in the data,
> clearly not identifying that they are noise.

I'm sorry I didn't put axes on the graph... I think you have been misreading
it, as you are characterizing the fluctuations as "noise."  
Let me clarify:  The vertical axis is actual or predicted latency.
The horizontal is just "sample index".  Since I do
not display file size in the graph, yet I chose random files (and therefore
and requesting random file sizes from a group of 30 possibilities), this
makes it look like "noise" when viewed as measured latency over indexes.
But it's not -- because I am fetching random documents of random sizes,
I am collecting data with different distributions.
Maybe it will help to look at a few entries in inp.dat to see what I mean.
The data are well-matched by the
SVM precisely *because* it has seperated the noise from the signal.
*Every* prediction for every classifier is on a novel point -- The first
prediciton is based on training on the first sample point plus the size of
the second point.  The next prediction is based on the first two sample
points plus the size of the third point.  And so on for 99 predictions.
So every prediction is a generalization to new data.  The way the SVM
accurately predicts points shows that it's working correctly.  There is
no issue of overfitting whatsoever in the experiment as I have now set
it up.  The only "random fluctuations" in the data are those cases
where the SVM *doesn't* match the red target data.  In other words,
the vertical distance between blue prediction and the red line.
I just reread your comment about starting at 1600 and moving to 800
and I am further convinced that you are thinking this graph means
something other than it does, as I cannot make sense of these numbers.

To clarify this misunderstanding further, I've introduced another graph
on the webpage showing file size vs. latency in inp.dat.  As you can see,
the data are organized along a line, and this is the pattern that the SVM is
picking up well.  It could be said that each 50k of data accounts for
one second (1000 ms) of latency, and that this line goes through the
origin.

> 
> * Realistic datasets will be much larger and more noisy than this 
>   dataset, the fact that SVM can track the random fluctuations so 
>   closely indicates that it has just "learned" the data set, it hasn't
>   generalized to any underlying structure, it wouldn't have that luxury
>   with a realistic dataset which will have tens of thousands of samples

This is an incorrect interpretation of my report.  There is no possibility
of overfitting in the way I have structured this test.  It may be
useful to look at the Predictor.java source runLATest function for
the exact detail of why overfitting is not an option here if my explanations
have not yet convinced you.

> 
> * Some of the points produced by the current BDA3 implementation are a 
>   bit suspect making me wonder whether there might be a bug - this, 
>   however - isn't my main concern
They seem to be working in every point I've inspected.  If there are
other points that seem wrong to you we should find them for
completeness sake.

>   powerful, only a head-to-head comparison between RTE and SVM will be
>   a fair test
I agree this is the best answer.

> I really feel that the only fair comparison will be with a large
> realistic dataset, and being compared against the actual algorithm we 
> have developed for this purpose, for which even BDA3 is a primitive 
> substitute.  I will try to extract such a dataset from Hui so that we 
> can make some progress here.
Ok this will be the best test.

Rudi

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Reply via email to