Chris and Serge,

You might want to look at this:
http://research.microsoft.com/~horvitz/junkfilter.htm

Turns out this was a research project between some folks at Microsoft and
Stanford about four years ago.  Good data in the paper.

        --- Noel

-----Original Message-----
From: Chris Means [mailto:[EMAIL PROTECTED]]
Sent: Sunday, August 18, 2002 23:47
To: Serge Knystautas; James Users List
Subject: RE: Anti-SPAM mailet


Hi Serge,

I've written a mailet that will work to build the word stats from messages
that are mailed to a particular email address on the server.

I'll be posting the code to the dev list for comments etc. later tonight.

My first pass uses a JDBC backend (table of words/occurances), which is
loads at James start, then saves when James it shutdown...just a first
pass...as I'd want the 'stats' updated more frequently.

IMAP just isn't there with James yet, and I wanted to make something that
would be relatively flexible (it's easy to just forward a message to a
'specific' account...one for SPAM samples, another for "good" samples).

Maybe we can put our parts together to make a whole...

-Chris

> -----Original Message-----
> From: Serge Knystautas [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, August 18, 2002 10:09 PM
> To: James Users List; [EMAIL PROTECTED]
> Subject: Re: Anti-SPAM mailet
>
>
> Chris,
>
> I came across this link as well... I'm convinced this is a far more
> effective spam blocker than any blacklist/checksum/group spam blocker.
> Looks very very promising.
>
> I went ahead and put together a bunch of the code for this... I thought
> about how you would best want to build the corpus and for my money, I
> decided I would create the corpus based on IMAP folders.  I'm
> working on an
> ant task that could on a daily or weekly basis trove a set of IMAP folders
> to build the good and bad corpus.
>
> Anyway, but I wrote code to tokenize MimeMessages, the code that compares
> the good and bad corpus and builds the probability token set, the Bayesian
> calculator to combine the probabilities of the 15 most interesting words,
> and some other related utilities.  It's still a ways from being anything
> useful, and it would be really great once James has solid IMAP support.
>
> The hard part about this approach though is you need a decent sized corpus
> to make it really usable.  I think it's pretty clear you could have a
> matcher use the probability set to either mark the message as
> spam or not...
> but again building that corpus is the hardest.
>
> Serge Knystautas
> Loki Technologies
> http://www.lokitech.com/
> ----- Original Message -----
> From: "Chris Means" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>
> Sent: Friday, August 16, 2002 2:48 PM
> Subject: Anti-SPAM mailet
>
>
> > Would anyone be interested in developing a maillet (or whatever) to
> > implement some of the anti-spam techniques described in this article
> > mentioned on /.?
> >
> > http://www.paulgraham.com/spam.html
> >
> > I'd rather it put a flag in the email so I could filter it in my email
> > client, but it would be nice to have the option of automatically
> forwarding
> > it to SPAMCop etc. for reporting purposes.
> >
> > Any thoughts?
> >
> > Thanks.
> >
> > -Chris


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to