We discussed this at ApacheCon, but since all development discussion
in theory happens here, here it is!

Goal:

 - reduce mass-check process at release time to a single cycle

 - produce slightly higher-weight Bayes scores because autolearning
   produces the most pessimistic results of how people use Bayes
   (all forms of manual training work better)

How to fix:

 - add sample-based "autolearning" to mass-check

   mass-check will learn by sampling ham and spam rather than using the
   autolearn algorithm, removing the need for multi-stage mass-check at
   release time (since the feedback loop prevention is no longer needed)

   possibly add a sampling error to simulate autolearning error as well
   as human error (probably split the difference)

 - Then we can do a single run with network and bayes turned on and
   produce all four score sets.

 - Related, but non-required change to autolearning: the balancing of
   spam and ham levels to improve accuracy, experiment with additional
   changes to autolearning (ALL_TRUSTED and such) to improve results.

I looked at my spam and ham corpus to see how often autolearning got
things wrong in real-time:

  - spam learned as ham: 0.06% of the time
  - ham learned as spam: never

Which is about what we expected qualitatively: spam is mistakenly
learned as ham much more often than the reverse.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Reply via email to