Sidney Markowitz <[EMAIL PROTECTED]> writes: > BTW, at least one spam learning filter I've seen reduces its memory > requirements by using a small hash size (like 32 bits) for representing > tokens. Such systems will show poorer results for learn everything > compared to learn on error simply because of collision effects once they > learn too many tokens.
Big difference between 32 bits and 40 bits assuming normal/random distribution: birthday for 2**40: 1234605 birthday for 2**32: 77164 I'm less interested in train-on-error vs. -everything and expiration and more interested in how we can improve autolearning by autoadjusting the thresholds and balancing the spam/ham train volume better. Daniel -- Daniel Quinlan http://www.pathname.com/~quinlan/
