Hi Serge, I've written a mailet that will work to build the word stats from messages that are mailed to a particular email address on the server.
I'll be posting the code to the dev list for comments etc. later tonight. My first pass uses a JDBC backend (table of words/occurances), which is loads at James start, then saves when James it shutdown...just a first pass...as I'd want the 'stats' updated more frequently. IMAP just isn't there with James yet, and I wanted to make something that would be relatively flexible (it's easy to just forward a message to a 'specific' account...one for SPAM samples, another for "good" samples). Maybe we can put our parts together to make a whole... -Chris > -----Original Message----- > From: Serge Knystautas [mailto:[EMAIL PROTECTED]] > Sent: Sunday, August 18, 2002 10:09 PM > To: James Users List; [EMAIL PROTECTED] > Subject: Re: Anti-SPAM mailet > > > Chris, > > I came across this link as well... I'm convinced this is a far more > effective spam blocker than any blacklist/checksum/group spam blocker. > Looks very very promising. > > I went ahead and put together a bunch of the code for this... I thought > about how you would best want to build the corpus and for my money, I > decided I would create the corpus based on IMAP folders. I'm > working on an > ant task that could on a daily or weekly basis trove a set of IMAP folders > to build the good and bad corpus. > > Anyway, but I wrote code to tokenize MimeMessages, the code that compares > the good and bad corpus and builds the probability token set, the Bayesian > calculator to combine the probabilities of the 15 most interesting words, > and some other related utilities. It's still a ways from being anything > useful, and it would be really great once James has solid IMAP support. > > The hard part about this approach though is you need a decent sized corpus > to make it really usable. I think it's pretty clear you could have a > matcher use the probability set to either mark the message as > spam or not... > but again building that corpus is the hardest. > > Serge Knystautas > Loki Technologies > http://www.lokitech.com/ > ----- Original Message ----- > From: "Chris Means" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Friday, August 16, 2002 2:48 PM > Subject: Anti-SPAM mailet > > > > Would anyone be interested in developing a maillet (or whatever) to > > implement some of the anti-spam techniques described in this article > > mentioned on /.? > > > > http://www.paulgraham.com/spam.html > > > > I'd rather it put a flag in the email so I could filter it in my email > > client, but it would be nice to have the option of automatically > forwarding > > it to SPAMCop etc. for reporting purposes. > > > > Any thoughts? > > > > Thanks. > > > > -Chris > > > > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
