That helps a bit... I'm using your example config's, the second method, but I only ever get an index of 0.308, even though I have 9,000+ terms in spam and 3000+ in ham tables. I also have no counts in messages, and nothing in corpus. What am I missing? I've been adding mail using the spam@ notspam@ mailets. Have I completely missed the trick here?!
d. > -----Original Message----- > From: Chris Means [mailto:[EMAIL PROTECTED]] > Sent: 28 January 2003 15:00 > To: James Users List > Subject: RE: Spam filtering mailets wanted...[spam 0.308] > > > Hi Danny, > > > I'm slightly confused as to config.. > > what is the idea with the two methods.. > > > > Method 1: > > Load the corpus directly. > > This relies on outside processes to build and maintain > the corpus. > > > > Method 2: > > Load the ham & spam tokens/counts and rebuild the corpus. > > This relies on outside processes to maintain the ham & spam > > token/counts tables. > > The routines have two "data stages", where their state can be preserved or > not as desired. > > Stage 1. Email messages (Ham & Spam) need to be analyized to determine > token counts. > > Stage 2. Those token counts are used to build the corpus. > > If you're keeping a repository of the raw email messages (both Ham & Spam) > then you could rebuild the token counts from scratch each time you rebuild > the corpus. This saves needing and maintaining the *_ham, *_spam, and > *_messagecounts tables. It would take longer to perform the analysis, but > it could be performed at any time, and even on a different > machine than the > mail server, but it wouldn't have to spend the processing time to maintain > the additional tables. > > If however, you're not keeping a repository of the raw email > messages, or if > the admin wants to be able to dynamically maintain the corpus (updating it > daily or something), then Method 2 would be better. This would allow > messages to be flagged by the user as SPAM and immediately update the > corpus. This method would allow users/administrators to > potentially quickly > stop new forms of SPAM getting past the blockers, and to use a variety of > mechanisms for adding new Spam & Ham messages. > > Does that help explain the two methods a little better? > > -Chris > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
