Nick Leverton said that papers he has seen found that learn on error always works better than learn everything. But I recall one that looked more carefully at longer term results and found that learn on error degrades over time. They found it best to retrain on fresh data every few months. (I don't have the reference handy).

That makes sense if you consider that spam (and possibly ham) patterns change over time, even more so to the degree that spam patterns are actively adapting to try to beat spam filters.

BTW, at least one spam learning filter I've seen reduces its memory requirements by using a small hash size (like 32 bits) for representing tokens. Such systems will show poorer results for learn everything compared to learn on error simply because of collision effects once they learn too many tokens.

What I haven't seen discussed is the effect of token expiration as is done SpamAssassin. Wouldn't that produce he same effect as periodic retraining, thereby allowing learn on everything to work well? Doesn't that prevent the problems of converging to a mean and slowing down the learning? How does the effect of token expiration compare to the use of back-propagation?

-- sidney

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to