Sidney Markowitz wrote:
As part of a term project I'm about to finish I've been looking at some
aspects of the perceptron scoring we do and have some ideas for alternatives
I would like to try.

Can someone tell me how many email samples and how many rules typically go
into the perceptron run and how long it usually takes to generate the scores
from them? How many iterations does it usually take to converge?

For 3.0, I used a corpus of 500k.  This was probably excessive.  Things
converged after the first few epochs.. the rest (<1000) were spent fine
tuning.

This next questions are aimed toward Henry or anyone else who understands them:

What do you think about using a support vector machine (SVM) instead of a
perceptron? Shouldn't that help a lot with overfitting and produce scores
that are more robust in the face of the inevitable changes that creep in to
the data before the next release and rescoring?

If someone can figure out how to train the SVM while maintaining the
security constraints, I'm all for it.  However, we haven't seen much
difference in classification accuracy between our secure perceptron and
a standard linear SVM.  Perhaps the real trick is in increasing
dimensionality with a kernel function?

What about a technique like RFE (recursive feature extraction) along with
SVM to select rules to be eliminated, rather than only looking at individual
S/O ratios? Or is that not as applicable here because "spam" is not one
homogeneous class and rules act individually to catch members of different
subclasses of the general class "spam"?

I'm not familiar with RFE.

Ok, 24 hours until my term paper is due... That's enough distracting myself :-)

Best of luck!
Henry

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to