Re: NOTICE: 3.1.0 rescoring mass-checks

Nix Sat, 02 Jul 2005 02:47:09 -0700

On Fri, 1 Jul 2005, Robert Menschel yowled:
> Since I wasn't mass-checking Bayes, all I did was one mass-check run
> specifying only my ham corpus, and then a second mass-check run
> specifying only my spam corpus.  I then combined them for the
> frequency analysis.
> 
> It should be feasible to modify the rescoring mass-check instructions
> so you do something like:
> a) initialize the mass-check (including remove any prior Bayes
> database)
> b) split your ham corpus (1-2 years) into 10 equal parts. Split your
> spam corpus (2-6 months) into 10 equal parts.
> c) Cycle through your 20 corpus files, running mass-check on each:
> oldest ham, oldest spam, next oldest ham, next oldest spam, etc.
> d) Combine all ham logs into one, combine all spam logs into one.
> 
> It's not optimal, in that Bayes will be trained on emails out of time
> sequence, but it should shuffle them enough to get useful results out
> of it, IMO.


This is far more elaborate than needed, I think. Limiting the age of
your spam corpus (which I do anyway) and using mass-check normally will
do the trick, as mass-check runs through mails in temporal order.  The
only `error' will be that ham of age [now - a couple of years] will
cohabit in the Bayes DB with spam of age [now - six months]. If this
caused a problem Bayes would be nearly useless anyway :)

If expiry runs it ditches the ancient email first in any case.


I think I'll do a few local perceptron runs with mass-checks with
different --limits after the rescoring mass-check is completed, and
see just what effect varying the limit on ham actually has. I'm
blithering in the absence of data right now.

-- 
`I lost interest in "blade servers" when I found they didn't throw knives
 at people who weren't supposed to be in your machine room.'
    --- Anthony de Boer

Re: NOTICE: 3.1.0 rescoring mass-checks

Reply via email to