Hi Danny, > I'm slightly confused as to config.. > what is the idea with the two methods.. > > Method 1: > Load the corpus directly. > This relies on outside processes to build and maintain the corpus. > > Method 2: > Load the ham & spam tokens/counts and rebuild the corpus. > This relies on outside processes to maintain the ham & spam > token/counts tables.
The routines have two "data stages", where their state can be preserved or not as desired. Stage 1. Email messages (Ham & Spam) need to be analyized to determine token counts. Stage 2. Those token counts are used to build the corpus. If you're keeping a repository of the raw email messages (both Ham & Spam) then you could rebuild the token counts from scratch each time you rebuild the corpus. This saves needing and maintaining the *_ham, *_spam, and *_messagecounts tables. It would take longer to perform the analysis, but it could be performed at any time, and even on a different machine than the mail server, but it wouldn't have to spend the processing time to maintain the additional tables. If however, you're not keeping a repository of the raw email messages, or if the admin wants to be able to dynamically maintain the corpus (updating it daily or something), then Method 2 would be better. This would allow messages to be flagged by the user as SPAM and immediately update the corpus. This method would allow users/administrators to potentially quickly stop new forms of SPAM getting past the blockers, and to use a variety of mechanisms for adding new Spam & Ham messages. Does that help explain the two methods a little better? -Chris -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
