Alexander K. Seewald wrote:
> I have initially used linear regression
> (which similarily has no such guarantees) with good results.

I think I understand better now. What you are saying is that using the
SVM instead of the perceptron is not what is giving you so much better
results, although it could be expected to be somewhat better because of
maximizing the margin of the hyperplane. Your much better results come
from training on mail that is more representative of your users. And
your major contribution has been to set up a procedure and scripts that
makes it easy to collect a corpus and generate scores that are specific
to your users' environment.

> My intention was to provide this tool as a way for others which are
> similarily disappointed with the default score set to train their
> own score set easily. Whether or not the SA developers adopt SVMs instead
> of a repeated-run perceptron is of no concern for me. I still think
> that this warrants a link from the SA page. If you are of a
> different opinion, I will just rely on Google.

A link to your tool does not require ASF license instead of GPL. That is
just a requirement for code that we distribute as part of SpamAssassin.
We have no philosophical objection to people making their code available
under GPL or not wanting to see their code part of closed source
commercial products. That is your choice, just as it is the choice of
the Apache Software Foundation to make code available under the terms of
the ASF license, and the choice of the SpamAssassin developers to make
our project part of the ASF.

There is no problem with the link. All such links are located on our
wiki, starting at http://wiki.apache.org/spamassassin/ThirdPartySoftware

In case you aren't familiar with a wiki, it is a web site that can be
edited by anyone, so you are free to add the link yourself. The page I
linked to above has several categories for open source software. Either
add yours in the appropriate category, or add a new category if
necessary. It would be best if you did the edit yourself, choosing the
category you think is best and describing your software the way you
would like to. That's why we use a wiki for such documentation.

We do appreciate contributions of links to useful tools like yours that
are not part of the core SpamAssassin distribution.

> This is not interesting for me, as SA - even when trained with SVMs
> - performs very similar to a pure NaiveBayes learner (SpamBayes),
> and a single learner is preferrable for reasons of performance and
> the possibility for incremental updates, which would be hard for
> SA-Train.

Are you saying that there is no advantage in accuracy over using pure
NaiveBayes, but you prefer to use SA-Train because it is simpler than
ongoing incremental learning and the resulting model is smaller?

How does performance of SpamAssassin used with SA-Train for rules and
Bayes compare with using the same training set to train SpamBayes?

> you should be able to reproduce it in your scripts quite easily:
> * train all mails from the example set via sa-learn
> * run Algorithm::SVM (lambda=1) on the set of rules, similar to the
>   perceptron. This prevents the licensing issues.
> * extract weights and optionally apply model to test set.

Thank you. I'm glad that you won't mind if someone does decide to
implement your idea this way and contribute it to the project. In the
meantime, as I said we would be very happy to see you place a
description and a link in the appropriate place in the SpamAssassin wiki.

 -- Sidney Markowitz
    http://www.sidney.com

Reply via email to