Sidney Markowitz wrote:
As part of a term project I'm about to finish I've been looking at some aspects of the perceptron scoring we do and have some ideas for alternatives I would like to try.Can someone tell me how many email samples and how many rules typically go into the perceptron run and how long it usually takes to generate the scores from them? How many iterations does it usually take to converge?
For 3.0, I used a corpus of 500k. This was probably excessive. Things converged after the first few epochs.. the rest (<1000) were spent fine tuning.
This next questions are aimed toward Henry or anyone else who understands them: What do you think about using a support vector machine (SVM) instead of a perceptron? Shouldn't that help a lot with overfitting and produce scores that are more robust in the face of the inevitable changes that creep in to the data before the next release and rescoring?
If someone can figure out how to train the SVM while maintaining the security constraints, I'm all for it. However, we haven't seen much difference in classification accuracy between our secure perceptron and a standard linear SVM. Perhaps the real trick is in increasing dimensionality with a kernel function?
What about a technique like RFE (recursive feature extraction) along with SVM to select rules to be eliminated, rather than only looking at individual S/O ratios? Or is that not as applicable here because "spam" is not one homogeneous class and rules act individually to catch members of different subclasses of the general class "spam"?
I'm not familiar with RFE.
Ok, 24 hours until my term paper is due... That's enough distracting myself :-)
Best of luck! Henry
signature.asc
Description: OpenPGP digital signature
