Sidney Markowitz <[EMAIL PROTECTED]> writes:

> BTW, at least one spam learning filter I've seen reduces its memory 
> requirements by using a small hash size (like 32 bits) for representing 
> tokens. Such systems will show poorer results for learn everything 
> compared to learn on error simply because of collision effects once they 
> learn too many tokens.

Big difference between 32 bits and 40 bits assuming normal/random
distribution:

  birthday for 2**40:
  1234605

  birthday for 2**32:
  77164
 
I'm less interested in train-on-error vs. -everything and expiration and
more interested in how we can improve autolearning by autoadjusting the
thresholds and balancing the spam/ham train volume better.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Reply via email to