Once you've loaded the messages, you need to update the counts in the messages table. Depending upon which routine you're using to load the data, this may occur automatically or not.
Try manually updating the message count table, then you should be able to build the corpus. -Chris > -----Original Message----- > From: Danny Angus [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, January 28, 2003 10:28 AM > To: James Users List; [EMAIL PROTECTED] > Subject: RE: Spam filtering mailets wanted... > > > That helps a bit... > > I'm using your example config's, the second method, but I only > ever get an index of 0.308, even though I have 9,000+ terms in > spam and 3000+ in ham tables. > I also have no counts in messages, and nothing in corpus. > What am I missing? > I've been adding mail using the spam@ notspam@ mailets. > Have I completely missed the trick here?! > > d. > > > > > -----Original Message----- > > From: Chris Means [mailto:[EMAIL PROTECTED]] > > Sent: 28 January 2003 15:00 > > To: James Users List > > Subject: RE: Spam filtering mailets wanted...[spam 0.308] > > > > > > Hi Danny, > > > > > I'm slightly confused as to config.. > > > what is the idea with the two methods.. > > > > > > Method 1: > > > Load the corpus directly. > > > This relies on outside processes to build and maintain > > the corpus. > > > > > > Method 2: > > > Load the ham & spam tokens/counts and rebuild the corpus. > > > This relies on outside processes to maintain the ham & spam > > > token/counts tables. > > > > The routines have two "data stages", where their state can be > preserved or > > not as desired. > > > > Stage 1. Email messages (Ham & Spam) need to be analyized to determine > > token counts. > > > > Stage 2. Those token counts are used to build the corpus. > > > > If you're keeping a repository of the raw email messages (both > Ham & Spam) > > then you could rebuild the token counts from scratch each time > you rebuild > > the corpus. This saves needing and maintaining the *_ham, *_spam, and > > *_messagecounts tables. It would take longer to perform the > analysis, but > > it could be performed at any time, and even on a different > > machine than the > > mail server, but it wouldn't have to spend the processing time > to maintain > > the additional tables. > > > > If however, you're not keeping a repository of the raw email > > messages, or if > > the admin wants to be able to dynamically maintain the corpus > (updating it > > daily or something), then Method 2 would be better. This would allow > > messages to be flagged by the user as SPAM and immediately update the > > corpus. This method would allow users/administrators to > > potentially quickly > > stop new forms of SPAM getting past the blockers, and to use a > variety of > > mechanisms for adding new Spam & Ham messages. > > > > Does that help explain the two methods a little better? > > > > -Chris > > > > > > > > -- > > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
