On Sun, 18 Mar 2018 03:46:58 +0530
Saahil Sirowa wrote:

> Temporary Draft of my GSoC Proposal
> GSoC 2018 Proposal
> <https://docs.google.com/document/d/1-OCNv79sHvVViKwnrRYtlMiKWLCzz4xUW4tNOlmaTmw/edit?usp=sharing>

A few points.


You're placing too much emphasis on the lack of statistical
independence in tokens. This is at most a minor problem and Paul Graham
alluded to some theoretical work that suggests it might actually be
beneficial - I've not seen it though.  The assumption of independence
can be removed by switching to the chi-squared method which SA did
many years ago. It's still called Bayes, but it's not Bayesian anymore.

 
You say that [Naive] Bayes doesn't scale well, but it actually does. The
important thing is how classification and training scale with the number
of tokens in the database and that can be O(1) with the right database
back-end.


Also note that Bayes only needs access to an email at the time of
[auto]training. It doesn't require the admin to maintain corpora of spam
and ham. Any statistical filter that needs that will likely be less
widely used than one that doesn't.

Reply via email to