As part of a term project I'm about to finish I've been looking at some aspects of the perceptron scoring we do and have some ideas for alternatives I would like to try.
Can someone tell me how many email samples and how many rules typically go into the perceptron run and how long it usually takes to generate the scores from them? How many iterations does it usually take to converge? This next questions are aimed toward Henry or anyone else who understands them: What do you think about using a support vector machine (SVM) instead of a perceptron? Shouldn't that help a lot with overfitting and produce scores that are more robust in the face of the inevitable changes that creep in to the data before the next release and rescoring? What about a technique like RFE (recursive feature extraction) along with SVM to select rules to be eliminated, rather than only looking at individual S/O ratios? Or is that not as applicable here because "spam" is not one homogeneous class and rules act individually to catch members of different subclasses of the general class "spam"? Ok, 24 hours until my term paper is due... That's enough distracting myself :-) -- sidney
signature.asc
Description: OpenPGP digital signature
