Alexander K. Seewald wrote: > I have initially used linear regression > (which similarily has no such guarantees) with good results.
I think I understand better now. What you are saying is that using the SVM instead of the perceptron is not what is giving you so much better results, although it could be expected to be somewhat better because of maximizing the margin of the hyperplane. Your much better results come from training on mail that is more representative of your users. And your major contribution has been to set up a procedure and scripts that makes it easy to collect a corpus and generate scores that are specific to your users' environment. > My intention was to provide this tool as a way for others which are > similarily disappointed with the default score set to train their > own score set easily. Whether or not the SA developers adopt SVMs instead > of a repeated-run perceptron is of no concern for me. I still think > that this warrants a link from the SA page. If you are of a > different opinion, I will just rely on Google. A link to your tool does not require ASF license instead of GPL. That is just a requirement for code that we distribute as part of SpamAssassin. We have no philosophical objection to people making their code available under GPL or not wanting to see their code part of closed source commercial products. That is your choice, just as it is the choice of the Apache Software Foundation to make code available under the terms of the ASF license, and the choice of the SpamAssassin developers to make our project part of the ASF. There is no problem with the link. All such links are located on our wiki, starting at http://wiki.apache.org/spamassassin/ThirdPartySoftware In case you aren't familiar with a wiki, it is a web site that can be edited by anyone, so you are free to add the link yourself. The page I linked to above has several categories for open source software. Either add yours in the appropriate category, or add a new category if necessary. It would be best if you did the edit yourself, choosing the category you think is best and describing your software the way you would like to. That's why we use a wiki for such documentation. We do appreciate contributions of links to useful tools like yours that are not part of the core SpamAssassin distribution. > This is not interesting for me, as SA - even when trained with SVMs > - performs very similar to a pure NaiveBayes learner (SpamBayes), > and a single learner is preferrable for reasons of performance and > the possibility for incremental updates, which would be hard for > SA-Train. Are you saying that there is no advantage in accuracy over using pure NaiveBayes, but you prefer to use SA-Train because it is simpler than ongoing incremental learning and the resulting model is smaller? How does performance of SpamAssassin used with SA-Train for rules and Bayes compare with using the same training set to train SpamBayes? > you should be able to reproduce it in your scripts quite easily: > * train all mails from the example set via sa-learn > * run Algorithm::SVM (lambda=1) on the set of rules, similar to the > perceptron. This prevents the licensing issues. > * extract weights and optionally apply model to test set. Thank you. I'm glad that you won't mind if someone does decide to implement your idea this way and contribute it to the project. In the meantime, as I said we would be very happy to see you place a description and a link in the appropriate place in the SpamAssassin wiki. -- Sidney Markowitz http://www.sidney.com
