This is very good. I wrote a project paper last semester comparing the results of using a single layer perceptron such as we use to score rules with a linear kernel SVM for classification of cancer cells from microarray data. The conclusion was that single layer perceptrons are not as bad as bioinformatics people generally assume, but linear kernel SVMs still are a bit better. I expected that we would see similar results in rule scoring -- I.e., we know that the perceptron performs ok but should see somewhat better scoring from a linear kernel SVM. It was on my eventual to-do list to try it out. I'm glad Alexander was able to do it.
In theory the primary advantage of SVM over perceptron should be that the rule scores produce better results on the mail that it is used on after the initial training. The idea is that the results of an SVM are more tolerant of changes in the data that occur over time, and that is a consideration because spam is always evolving. Questions that I have regarding Alexander's SVM: Alexander, did you try using the SpamAssassin perceptron and compare the results of using its rule scores with using the scores from the SVM? How does the speed of the SVM in perl compare with the perceptron in C? It's ok if it is much slower, as long as it is still practical to run rule scoring when we do a release, but it would be good to know that it is still practical to do that. For anyone with an opinion: The SVM code in Alexander's program says that it is a perl port he did of the SVM code in WEKA, which is written in Java. WEKA is licensed under GPL. Does anyone have an informed opinion on whether porting a java program to perl makes it a derivative work for copyright purposes and if we would have licensing issues with that? There is a CPAN module Algorithm::SVM which is a perl interface to the libsvm module. That would provide C speed performance. The libsvm module is very high performance, actively maintained, and probably the most widely used version of SVM code out there. It would not present the licensing questions. ( http://www.csie.ntu.edu.tw/~cjlin/libsvm/COPYRIGHT ) Alexander, what do you think of calling Algorithm::SVM instead of the code that you ported? -- sidney
