We discussed this at ApacheCon, but since all development discussion in theory happens here, here it is!
Goal: - reduce mass-check process at release time to a single cycle - produce slightly higher-weight Bayes scores because autolearning produces the most pessimistic results of how people use Bayes (all forms of manual training work better) How to fix: - add sample-based "autolearning" to mass-check mass-check will learn by sampling ham and spam rather than using the autolearn algorithm, removing the need for multi-stage mass-check at release time (since the feedback loop prevention is no longer needed) possibly add a sampling error to simulate autolearning error as well as human error (probably split the difference) - Then we can do a single run with network and bayes turned on and produce all four score sets. - Related, but non-required change to autolearning: the balancing of spam and ham levels to improve accuracy, experiment with additional changes to autolearning (ALL_TRUSTED and such) to improve results. I looked at my spam and ham corpus to see how often autolearning got things wrong in real-time: - spam learned as ham: 0.06% of the time - ham learned as spam: never Which is about what we expected qualitatively: spam is mistakenly learned as ham much more often than the reverse. Daniel -- Daniel Quinlan http://www.pathname.com/~quinlan/
