On 19/11/14 01:31, Robert J. Hansen wrote: > No. Client-side, you get to inspect (fully) only your data, and you > have to develop a statistical model of spam based on only your data. > When Gmail filters, it inspects (fully) traffic to *millions* of users, > and uses that to create a model no individual user can hope to match.
I agree with several other important points you raise, but this one is not a big deal. I have a highly customized mail setup. My SpamAssassin downloads rules from the internet, but trains its Bayesian filter on only the e-mail I personally receive. Everyone who has ever sent me a non-spam mail is added to a whitelist. Mail from whitelisted people never gets automatically moved to the Spam box, and my mail client shows their messages in a different color. As soon as I receive a spam mail from such an address, it is immediately (manually) deleted from the whitelist (actually moved to the greylist so it's not added to the whitelist again next time). I have an empty blacklist. It exists, though. It would cause mail to be silently deleted. Somebody once had the honour of having me create it and put him on it :). SpamAssassin throws spams in a Spam folder for me to check every few weeks. I sort them by subject line so I can quickly scan through. Checked spam that I perceived as spam is still kept around for quite a while, just in case someone writes to me "I wrote you months ago and you haven't replied". Then I can go back to everything I've already written off as spam to see if I looked past their mail. This setup works great for me. If I get a few false positives in a year, it is a lot. They are so scarce that I'm completely unsure what the actual number is. I do get false negatives, but it doesn't feel like more than 10 each week. Every now and then a short surge of nearly identical spams, though.[1] I still think your overall point stands, and stands tall. But the spam filtering issue; from personal experience, I don't think that's a really major issue. If it were, I'm sure we can think of some way to have publicly available training data that can be refined by individuals who can feed it back to the publicly available data. It might need some thought: you don't want to have a really classified mail which got qualified as spam to upload new words to the public data. So probably most individuals would only adjust existing weights, and only some setups would contribute new words. This could come from spamtraps and organisations or even individuals who send in complete training mails. And perhaps this all is even not necessary, and the system would be just as effective with a big corpus of data where only weights are changed by submissions. But this is all a bit beside the point. The point is that spam filtering works just fine on an individual level, for me. And if it would create problems, I'm sure we can think of things that would solve that specific issue. Peter. PS: By the way, some mail is already denied at the mailserver and never enters the system. The most important instance of this is mail purporting to come from myself, but not originating from within my own network. Lots of spammers send you spams from your own address, be it in the envelope or in the headers. I run my own webmail server, so even if I need to send myself a message and I didn't bring my laptop, it would still originate from my own webmail server. [1] Actually that is a case where the distributed solution truely excels: quickly homing in on the latest mass mailing. The sheer number of identical mails alone is a big warning sign, and a lot of people will start reporting them as spam. -- I use the GNU Privacy Guard (GnuPG) in combination with Enigmail. You can send me encrypted mail if you want some privacy. My key is available at <http://digitalbrains.com/2012/openpgp-key-peter> _______________________________________________ Gnupg-users mailing list [email protected] http://lists.gnupg.org/mailman/listinfo/gnupg-users
