On Fri, 1 Jul 2005, Robert Menschel yowled:
> Since I wasn't mass-checking Bayes, all I did was one mass-check run
> specifying only my ham corpus, and then a second mass-check run
> specifying only my spam corpus. I then combined them for the
> frequency analysis.
>
> It should be feasible to modify the rescoring mass-check instructions
> so you do something like:
> a) initialize the mass-check (including remove any prior Bayes
> database)
> b) split your ham corpus (1-2 years) into 10 equal parts. Split your
> spam corpus (2-6 months) into 10 equal parts.
> c) Cycle through your 20 corpus files, running mass-check on each:
> oldest ham, oldest spam, next oldest ham, next oldest spam, etc.
> d) Combine all ham logs into one, combine all spam logs into one.
>
> It's not optimal, in that Bayes will be trained on emails out of time
> sequence, but it should shuffle them enough to get useful results out
> of it, IMO.
This is far more elaborate than needed, I think. Limiting the age of
your spam corpus (which I do anyway) and using mass-check normally will
do the trick, as mass-check runs through mails in temporal order. The
only `error' will be that ham of age [now - a couple of years] will
cohabit in the Bayes DB with spam of age [now - six months]. If this
caused a problem Bayes would be nearly useless anyway :)
If expiry runs it ditches the ancient email first in any case.
I think I'll do a few local perceptron runs with mass-checks with
different --limits after the rescoring mass-check is completed, and
see just what effect varying the limit on ham actually has. I'm
blithering in the absence of data right now.
--
`I lost interest in "blade servers" when I found they didn't throw knives
at people who weren't supposed to be in your machine room.'
--- Anthony de Boer