On Sun, 18 Mar 2018 03:46:58 +0530 Saahil Sirowa wrote: > Temporary Draft of my GSoC Proposal > GSoC 2018 Proposal > <https://docs.google.com/document/d/1-OCNv79sHvVViKwnrRYtlMiKWLCzz4xUW4tNOlmaTmw/edit?usp=sharing>
A few points. You're placing too much emphasis on the lack of statistical independence in tokens. This is at most a minor problem and Paul Graham alluded to some theoretical work that suggests it might actually be beneficial - I've not seen it though. The assumption of independence can be removed by switching to the chi-squared method which SA did many years ago. It's still called Bayes, but it's not Bayesian anymore. You say that [Naive] Bayes doesn't scale well, but it actually does. The important thing is how classification and training scale with the number of tokens in the database and that can be O(1) with the right database back-end. Also note that Bayes only needs access to an email at the time of [auto]training. It doesn't require the admin to maintain corpora of spam and ham. Any statistical filter that needs that will likely be less widely used than one that doesn't.
