On Tue, Jul 22, 2003 at 12:27:33PM -0400, Zlatin Balevsky wrote:
> do you think we could use that to make global estimates across all 
> nodes, i.e. instead of estimating the response time for each node in the 
> rt and then choosing the best, could we add the routingTable contents as 
> a discrete dimension and just get a value for that from the machine?

Yes, both options are conceivable.  I think we need to do some testing
to find out which works better.  Let me propose one possible architecture
that has the decided advantage of simplicity (to start) below.

> Also, what would the cpu requirements be for the current case 
> (dimensional vectors) if the algorithm was used on between 10 to 100 
> learning samples up to 500 times per second?  Is adding new samples 
> expensive?  Roughly one/tenth of the queries on freenet result in data 
> being found and we may want to adjust for that.

SVM's work in two modes: Regression or Classification.  In both cases,
they receive a sequence of training samples, and are then asked to
make predictions.  Prediction is fast.  Training can be slower, but
you can make it less expensive by training on batches of input data
instead of on each new sample seperately.

Now, let's imagine what an SVM might learn about a single remote host:

In classification mode, the SVM is presented with a sequence of
pairs (TrainingDatum, TargetDatum) such as:
((time=17411,key=0x3a4b), target=SUCCESS)
((time=17480,key=0x3a4b), target=SUCCESS)
((time=17485,key=0x3a4c), target=SUCCESS) # net goes out, e.g.
((time=17487,key=0x3a4b), target=FAIL)
These training data could be used to make an SVM that classifies inputs
into two categories: SUCCESS or FAIL
Then, you might ask for a prediction:
((time=17489,key=0x3a4b)) and the SVM can reply "FAIL" for example.

In regression mode, the SVM is trained not to learn categories,
but instead some measurable quantity, for instance milliseconds of
latency for successful reqests:
((time=17411,key=0x3a4b), target=150)
((time=17480,key=0x3a4b), target=240)
((time=17485,key=0x3a4c), target=1780)    # net outage imminent
Then, the SVM predicts how long a request might take, given that
it will succeed.

My proposal, here, is to train a pair of SVM's per remote host with
only the data relevant to that host.  The first SVM will predict
if the request will succeed or fail.  If the first indicates
success, then the second will be called upon to predict how long the
request will take.

When deciding where to route, a simple heuristic to start may be to
sort all predicted successful hosts according to expected latency,
and choose the best.

As for overall integration within the FreeNet source, here is one
way of designing it:  At program startup, spawn off a single thread
called "TrainingThread".  As FreeNet runs, it queues data in a pending
queue from the main thread.  FreeNet is also calling the Predict function
rapidly to make routing decisions using the "last good SVM model".
The training thread is simply looping over 
and over: First, it looks to see if there is at least one new TrainingInput
in the pending queue.  If so, it reads in *all* of the new inputs.
Now, it begins a new training cycle with all the new data and all of
the old data it already had from previous cycles.  Maybe ten seconds
later, it is done training and now replaces the old model (which was
meanwhile being used for prediction) with the new model that includes
more data.  To prevent data from getting too great, we can set some
maximum number of TrainingInput samples allowed, say 1000 -- above this, 
and training points start getting randomly removed, to slowly rotate out
old points nondeterministically.  To prevent CPU overloading, we
can also add an optional Thread.sleep somewhere in there to throttle
the TrainingThread.

If this sounds like a good path to follow, I think a good first step
would be to clearly define the TrainingInput and TrainingTarget classes.
Then, use Java object Serialization to write a number of these out to
a file for offline testing and tuning by me and whoever else is
interested in some statistical detective work.  These should be
captured via a "typical use" FreeNet session.  I can then make a better
assessment of how the learning algorithm is performing, and hopefully
we can make some comparisons using cross-validation to determine which
learning algorithms and parameters are working best.

Rudi

_______________________________________________
devl mailing list
[EMAIL PROTECTED]
http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl

Reply via email to