On Tue, Jul 22, 2003 at 12:27:33PM -0400, Zlatin Balevsky wrote: > do you think we could use that to make global estimates across all > nodes, i.e. instead of estimating the response time for each node in the > rt and then choosing the best, could we add the routingTable contents as > a discrete dimension and just get a value for that from the machine?
Yes, both options are conceivable. I think we need to do some testing to find out which works better. Let me propose one possible architecture that has the decided advantage of simplicity (to start) below. > Also, what would the cpu requirements be for the current case > (dimensional vectors) if the algorithm was used on between 10 to 100 > learning samples up to 500 times per second? Is adding new samples > expensive? Roughly one/tenth of the queries on freenet result in data > being found and we may want to adjust for that. SVM's work in two modes: Regression or Classification. In both cases, they receive a sequence of training samples, and are then asked to make predictions. Prediction is fast. Training can be slower, but you can make it less expensive by training on batches of input data instead of on each new sample seperately. Now, let's imagine what an SVM might learn about a single remote host: In classification mode, the SVM is presented with a sequence of pairs (TrainingDatum, TargetDatum) such as: ((time=17411,key=0x3a4b), target=SUCCESS) ((time=17480,key=0x3a4b), target=SUCCESS) ((time=17485,key=0x3a4c), target=SUCCESS) # net goes out, e.g. ((time=17487,key=0x3a4b), target=FAIL) These training data could be used to make an SVM that classifies inputs into two categories: SUCCESS or FAIL Then, you might ask for a prediction: ((time=17489,key=0x3a4b)) and the SVM can reply "FAIL" for example. In regression mode, the SVM is trained not to learn categories, but instead some measurable quantity, for instance milliseconds of latency for successful reqests: ((time=17411,key=0x3a4b), target=150) ((time=17480,key=0x3a4b), target=240) ((time=17485,key=0x3a4c), target=1780) # net outage imminent Then, the SVM predicts how long a request might take, given that it will succeed. My proposal, here, is to train a pair of SVM's per remote host with only the data relevant to that host. The first SVM will predict if the request will succeed or fail. If the first indicates success, then the second will be called upon to predict how long the request will take. When deciding where to route, a simple heuristic to start may be to sort all predicted successful hosts according to expected latency, and choose the best. As for overall integration within the FreeNet source, here is one way of designing it: At program startup, spawn off a single thread called "TrainingThread". As FreeNet runs, it queues data in a pending queue from the main thread. FreeNet is also calling the Predict function rapidly to make routing decisions using the "last good SVM model". The training thread is simply looping over and over: First, it looks to see if there is at least one new TrainingInput in the pending queue. If so, it reads in *all* of the new inputs. Now, it begins a new training cycle with all the new data and all of the old data it already had from previous cycles. Maybe ten seconds later, it is done training and now replaces the old model (which was meanwhile being used for prediction) with the new model that includes more data. To prevent data from getting too great, we can set some maximum number of TrainingInput samples allowed, say 1000 -- above this, and training points start getting randomly removed, to slowly rotate out old points nondeterministically. To prevent CPU overloading, we can also add an optional Thread.sleep somewhere in there to throttle the TrainingThread. If this sounds like a good path to follow, I think a good first step would be to clearly define the TrainingInput and TrainingTarget classes. Then, use Java object Serialization to write a number of these out to a file for offline testing and tuning by me and whoever else is interested in some statistical detective work. These should be captured via a "typical use" FreeNet session. I can then make a better assessment of how the learning algorithm is performing, and hopefully we can make some comparisons using cross-validation to determine which learning algorithms and parameters are working best. Rudi _______________________________________________ devl mailing list [EMAIL PROTECTED] http://hawk.freenetproject.org:8080/cgi-bin/mailman/listinfo/devl
