Chris and Serge,
You might want to look at this:
http://research.microsoft.com/~horvitz/junkfilter.htm
Turns out this was a research project between some folks at Microsoft and
Stanford about four years ago. Good data in the paper.
--- Noel
-----Original Message-----
From: Chris Means [mailto:[EMAIL PROTECTED]]
Sent: Sunday, August 18, 2002 23:47
To: Serge Knystautas; James Users List
Subject: RE: Anti-SPAM mailet
Hi Serge,
I've written a mailet that will work to build the word stats from messages
that are mailed to a particular email address on the server.
I'll be posting the code to the dev list for comments etc. later tonight.
My first pass uses a JDBC backend (table of words/occurances), which is
loads at James start, then saves when James it shutdown...just a first
pass...as I'd want the 'stats' updated more frequently.
IMAP just isn't there with James yet, and I wanted to make something that
would be relatively flexible (it's easy to just forward a message to a
'specific' account...one for SPAM samples, another for "good" samples).
Maybe we can put our parts together to make a whole...
-Chris
> -----Original Message-----
> From: Serge Knystautas [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, August 18, 2002 10:09 PM
> To: James Users List; [EMAIL PROTECTED]
> Subject: Re: Anti-SPAM mailet
>
>
> Chris,
>
> I came across this link as well... I'm convinced this is a far more
> effective spam blocker than any blacklist/checksum/group spam blocker.
> Looks very very promising.
>
> I went ahead and put together a bunch of the code for this... I thought
> about how you would best want to build the corpus and for my money, I
> decided I would create the corpus based on IMAP folders. I'm
> working on an
> ant task that could on a daily or weekly basis trove a set of IMAP folders
> to build the good and bad corpus.
>
> Anyway, but I wrote code to tokenize MimeMessages, the code that compares
> the good and bad corpus and builds the probability token set, the Bayesian
> calculator to combine the probabilities of the 15 most interesting words,
> and some other related utilities. It's still a ways from being anything
> useful, and it would be really great once James has solid IMAP support.
>
> The hard part about this approach though is you need a decent sized corpus
> to make it really usable. I think it's pretty clear you could have a
> matcher use the probability set to either mark the message as
> spam or not...
> but again building that corpus is the hardest.
>
> Serge Knystautas
> Loki Technologies
> http://www.lokitech.com/
> ----- Original Message -----
> From: "Chris Means" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, August 16, 2002 2:48 PM
> Subject: Anti-SPAM mailet
>
>
> > Would anyone be interested in developing a maillet (or whatever) to
> > implement some of the anti-spam techniques described in this article
> > mentioned on /.?
> >
> > http://www.paulgraham.com/spam.html
> >
> > I'd rather it put a flag in the email so I could filter it in my email
> > client, but it would be nice to have the option of automatically
> forwarding
> > it to SPAMCop etc. for reporting purposes.
> >
> > Any thoughts?
> >
> > Thanks.
> >
> > -Chris
--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>