Thus said Levi Pearson on Wed, 22 May 2013 13:41:19 -0600: > Google has a pretty big advantage here, as they've both got industry > experts at machine learning and a gigantic corpus with which to train > and tune their system.
They are also at a disadvantage having so much data because they use it incorrectly. Despite the number of times I click that ``spam'' button, those Facebook emails, and various other random ``welcome to X'' subscriptions, keep hitting my inbox. Their problem is that the data is not isolated enough---the decision that I make to mark a message as spam should influence *my* preferences only. I believe someone else mentioned in the thread that having the entire company use the same evidence for their filters made it more effective---this too is actually contrary to statistical analysis. Filters need to be as close to the individual user as possible, otherwise there is too much noise for analysis. In my own email, I've never used more than a simple bayesian filter, and it was much more accurate than SpamAssassin ever will be. It was also more accurate than Gmail's filter was (for me anyway, which is kind of the point I'm making). The best filter I've ever used was crm114, but it's knarly to configure. I've also used it for non-spam type analysis, which was pretty interesting work. This is pretty old, but still relevant: http://www.paulgraham.com/spam.html Andy -- TAI64 timestamp: 40000000519d998c /* PLUG: http://plug.org, #utah on irc.freenode.net Unsubscribe: http://plug.org/mailman/options/plug Don't fear the penguin. */
