Hi Vaishnavi,

I wrote a parser for the 12000 message SpamAssassin public corpus
(http://spamassassin.apache.org/publiccorpus) based on SpamAssassin's
Bayes code.  If you would like to use it, you can download both the
parser and a pre-tokenized corpus from
http://stern.cs.dal.ca/publiccorpus-tokenized.tar.bz2.

Henry

P.S.  Who is your advisor at UWash?

Vaishnavi Sannidhanam wrote:

Hi Theo,

I got a spam assassin corpus that had ~3500 ham and spam messages in it. I
was wondering if I could get a larger collection of corpus or a bunch of
smaller corpora that I can put together to get a bigger corpus. Please let
me know if I can get it from somewhere.

Thank you very much and really appreciate all your help,
Vaishnavi

-----Original Message-----
From: Theo Van Dinter [mailto:[EMAIL PROTECTED]
Sent: Friday, November 26, 2004 3:50 PM
To: [email protected]
Subject: Re: Spam assassin corpus


On Fri, Nov 26, 2004 at 02:06:05PM -0800, Vaishnavi Sannidhanam wrote:


I am a student a University of Washington and I am doing a project on
classifying spam. I was wondering where could I find the spam assassin
corpus of ham and spam mails and where would I also find some tools to
process these mails.



Hi.

Unfortunately there is no single "SpamAssassin corpus".  All of the people
involved in development (including the folks who help out with score
generation and testing) each have their own private corpus of messages. The
tools (specifically mass-check) under the "masses" directory (see the
tarball) are used to generate logs from the corpus specifying the messages
processed and the results from the processing (namely what rules hit).

That information is then used to generate the scores, determine which rules
are worth keeping during devleopment, etc.

There is some more information available at:

http://wiki.apache.org/spamassassin/DevelopmentStuff



Attachment: signature.asc
Description: OpenPGP digital signature



Reply via email to