Thus said Levi Pearson on Wed, 22 May 2013 13:41:19 -0600:

> Google has a  pretty big advantage here, as they've  both got industry
> experts at machine learning and a  gigantic corpus with which to train
> and tune their system.

They are also at a disadvantage having  so much data because they use it
incorrectly. Despite the  number of times I click  that ``spam'' button,
those  Facebook  emails,  and  various other  random  ``welcome  to  X''
subscriptions, keep hitting my inbox. Their  problem is that the data is
not isolated enough---the decision that I make to mark a message as spam
should influence *my* preferences only.

I  believe  someone  else  mentioned  in  the  thread  that  having  the
entire company  use the  same evidence  for their  filters made  it more
effective---this  too  is  actually contrary  to  statistical  analysis.
Filters  need  to be  as  close  to  the  individual user  as  possible,
otherwise there is too much noise for analysis.

In my own email, I've never used more than a simple bayesian filter, and
it was  much more accurate than  SpamAssassin ever will be.  It was also
more accurate than  Gmail's filter was (for me anyway,  which is kind of
the point I'm making).

The best filter I've ever used was crm114, but it's knarly to configure.
I've  also  used  it  for  non-spam  type  analysis,  which  was  pretty
interesting work.

This is pretty old, but still relevant:

http://www.paulgraham.com/spam.html

Andy
-- 
TAI64 timestamp: 40000000519d998c



/*
PLUG: http://plug.org, #utah on irc.freenode.net
Unsubscribe: http://plug.org/mailman/options/plug
Don't fear the penguin.
*/

Reply via email to