Hi Danny,

> I'm slightly confused as to config..
> what is the idea with the two methods..
>
>      Method 1:
>        Load the corpus directly.
>        This relies on outside processes to build and maintain the corpus.
>
>      Method 2:
>        Load the ham & spam tokens/counts and rebuild the corpus.
>        This relies on outside processes to maintain the ham & spam
>        token/counts tables.

The routines have two "data stages", where their state can be preserved or
not as desired.

Stage 1.  Email messages (Ham & Spam) need to be analyized to determine
token counts.

Stage 2.  Those token counts are used to build the corpus.

If you're keeping a repository of the raw email messages (both Ham & Spam)
then you could rebuild the token counts from scratch each time you rebuild
the corpus.  This saves needing and maintaining the *_ham, *_spam, and
*_messagecounts tables.  It would take longer to perform the analysis, but
it could be performed at any time, and even on a different machine than the
mail server, but it wouldn't have to spend the processing time to maintain
the additional tables.

If however, you're not keeping a repository of the raw email messages, or if
the admin wants to be able to dynamically maintain the corpus (updating it
daily or something), then Method 2 would be better.  This would allow
messages to be flagged by the user as SPAM and immediately update the
corpus.  This method would allow users/administrators to potentially quickly
stop new forms of SPAM getting past the blockers, and to use a variety of
mechanisms for adding new Spam & Ham messages.

Does that help explain the two methods a little better?

-Chris



--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to