Re: [eug-lug]Spam, filtering, and censorship

2003-11-12 Thread Patrick R. Wade
On Tue, Nov 11, 2003 at 04:44:20PM -0800, Marc Baber wrote:

I would say that spam, or at least the set of e-mails that one might 
want to be classified as spam, is *not* the same for everybody, in the 
case of politically motivated spam filtering.  Because the corpus body 
of collected spam e-mails is used to filter e-mail for all users, one 
person's spam report can affect e-mail delivered (or not delivered) to a 
large number of people.



The reason I talk about lost e-mails is because my account was defaulted 
into a delete spam mode when spam filtering was first introduced at 
EFN and I never saw filtered spam until I specifically contacted EFN to 
personally to ask that my account be exempted from that default.  I have 
no experience of receiving flagged spam as the default action for 
EFN's spam filter.  I had to lose an airline reservation e-mail and at 
least one job-seeking related e-mail before I became suspicious and 
started asking questions to learn that my account was defaulted to drop 
spam silently.  That was very frustrating for me and has made me the 
spam-filter-unfriendly guy I am today.


Hm.  There have been three periods in the history of efn's spam filters:

 2002 DNSBLs coupled with local sendmail rules
During this period, any email that was rejected by our servers,
would be bounced back to the sender, thereby meeting the RFC
requirement to either deliver or account for every piece of mail.
This was an increasingly labor-intensive solution, but it did
not generate very many complaints to the volunteer postmaster.

2002   DNSBLs coupled with SpamAssassin, auto-deleting
During this period, efn ran SpamAssassin, first on our main
incoming mailserver, and later on two dedicated hosts.  
Mail which was flagged as spam by SpamAssassin was automatically
deleted.  Mail blocked by DNSBLs continued to be bounced.
Reliability, both of the mailsystem as a whole, and of the spam
filter in particular, became embarassingly bad.  If i recall
correctly, it was during this period that your missing mail
episodes happened.  I apologize for, and continue to be ashamed
about, our mail performance during this period, but there was
really nothing more i could have done to fix it than i did, and
the problem was essentially political.  You are not the only one
to be wary of SpamAssassin on the basis of such experiences; our
debacle caused UO to become very wary of any futher experiments
with SpamAssassin.

2003  DNSBLs coupled with SpamAssassin, flagging
In early 2003 we experimented with bouncing back to the
sender mail which was flagged as spam by SpamAssassin.
This resolved the RFC-compliance problem, but did little
to improve the reliability issue.  Since then we have
been delivering with messages flagged by SpamAssassin
(DNSBL rejects are still bounced).  If people opt
to auto-delete flagged spam at delivery time, we do that
for them on a user-by-user basis with .procmailrc configuration.
We aim to enhance this mail system further with individual
user configurability.

If there is to be a central corpus of spam for all users, I'd like to 
see some accountability and transparency:

1. Who makes the final decision if an e-mail submitted to [EMAIL PROTECTED] 
or [EMAIL PROTECTED] is included in the corpus as such.  What are the 
relevant policies?  Is it automated or staffed?

At the moment, any email which is submitted as spam, and is recognized
by the postmasters as a sample email (rather than, say, a request for
whitelisting or tech support) is queued for eventual inclusion in
the Bayesian filter.  The sender's report is considered sufficient
evidence that the mail in question is indeed spam or tofu.  At present
none of it is committed to the Bayesian filter, which learns only
on the auto-learn basis of mail that it examines as it goes.

2. The corpus should be in an open web directory that is searchable. 

That is an interesting idea, i'm not sure how we'd implement it (MySQL?).
We are talking about millions of messages here.

 When the SpamAssassin says something is spam, there should be links to 
the reference e-mails in the corpus that were correlated with the spam, 
upon request, so a user can review whether the items in the corpus are 
objective spam or subjective spam.  The individual must have a way 
of reviewing the decisions or processes that contribute to the corpus.


That looks to be Very, Very Hard to do.  I don't know that SpamAssassin
has any sort of support for audit trail in its Bayesian mechanism,
and i would expect including it to signifigantly increase both the 
CPU cycles and the disk space needed to manage a mailstream as 
large as efn's.  It might be easier to do per-user Bayesian filters,
or perhaps to have a Spam Committee which must approve 

Re: [eug-lug]Spam, filtering, and censorship

2003-11-11 Thread Bob Miller
Larry Price wrote:

 spamassassin is a filtering solution analogous to using a pitchfork
 on a river in flood, it's effective at moving debris out of the way,
 but it's not exactly selective...

My question about all this is, does spamassassin's Bayesian
component use per-user filter tables or a single global table?
As installed at EFN, that is.

Anyone interested in Bayesian spam filtering should read these
three papers -- they pretty much defined the field last year
when they came out.

http://www.paulgraham.com/spam.html
http://www.paulgraham.com/better.html
http://www.linuxjournal.com/article.php?sid=6467

I'll also put in a plug for the Bayesian spam filter that Anne and I
use at home.  Since I installed it (3 weeks ago), I haven't seen a
single false positive on 200+ messages/day.  And yes, I've been
checking, though I won't keep checking much longer.

http://bogofilter.sourceforge.net/

-- 
Bob Miller  Kbob
kbobsoft software consulting
http://kbobsoft.com [EMAIL PROTECTED]
___
EuG-LUG mailing list
[EMAIL PROTECTED]
http://mailman.efn.org/cgi-bin/listinfo/eug-lug


Re: [eug-lug]Spam, filtering, and censorship

2003-11-11 Thread Bob Miller
Larry Price wrote:

 The current setup is not the final setup, we are currently moving 
 towards per user whitelists
 but the Bayesian filter is probably most useful as a global weight. The 
 trend I notice being that spam is generally more similar than 
 different, otherwise every one's filter will have to learn the same 
 patterns again.

Sure, spam is the same for everybody, but tofu is very very different.
For example, Marc sees the words Howard, Dean, campaign a whole
lot more frequently than I do, and I see chezgeek and kbob more
often.

Recognizing each user's tofu words separately significantly improves a
Bayesian filter's performance, at the cost that every user must train
his own filter.

Maybe I should give a EUGLUG talk on setting up bogofilter for
personal use.  I'd need some help from someone who understands POP and
IMAP -- I still use /var/spool/mail/$USER as my mail queue.

-- 
Bob Miller  Kbob
kbobsoft software consulting
http://kbobsoft.com [EMAIL PROTECTED]
___
EuG-LUG mailing list
[EMAIL PROTECTED]
http://mailman.efn.org/cgi-bin/listinfo/eug-lug


Re: [eug-lug]Spam, filtering, and censorship

2003-11-11 Thread Cory Petkovsek
On Tue, Nov 11, 2003 at 01:00:22PM -0800, Bob Miller wrote:
 Maybe I should give a EUGLUG talk on setting up bogofilter for
 personal use.  I'd need some help from someone who understands POP and
 IMAP -- I still use /var/spool/mail/$USER as my mail queue.

fetchmail

or

mozilla mail = 1.4 which includes baeysian filtering

-- 
Cory Petkovsek   Adapting Information
Adaptable IT ConsultingTechnology to your   
(541) 914-8417   business
[EMAIL PROTECTED]  www.AdaptableIT.com
___
EuG-LUG mailing list
[EMAIL PROTECTED]
http://mailman.efn.org/cgi-bin/listinfo/eug-lug


Re: [eug-lug]Spam, filtering, and censorship

2003-11-11 Thread Ben Barrett
I *love* mozilla mail, esp 1.5, although I don't use it myself at this
point.  I've set it up for a number of folks, here at my day job and
elsewhere, and everyone likes it.  One thing I've noted is that you have
to go into the spam filter config in the Tools menu, to tell it to dump
your [proposed] spam into a Junk folder, so that you can go in there
and pull out false negatives while you train it.  Without fail, I always
see a new MozillaMail setup call obviously-desirable mail spam -- the
easiest way to fix this is to start loading up your addressbook with
whitelisted addresses!  Cheers.

   Ben


On Tue, 11 Nov 2003 14:32:40 -0800
Cory Petkovsek [EMAIL PROTECTED] wrote:

| On Tue, Nov 11, 2003 at 01:00:22PM -0800, Bob Miller wrote:
|  Maybe I should give a EUGLUG talk on setting up bogofilter for
|  personal use.  I'd need some help from someone who understands POP
|  and IMAP -- I still use /var/spool/mail/$USER as my mail queue.
| 
| fetchmail
| 
| or
| 
| mozilla mail = 1.4 which includes baeysian filtering
| 
___
EuG-LUG mailing list
[EMAIL PROTECTED]
http://mailman.efn.org/cgi-bin/listinfo/eug-lug


Re: [eug-lug]Spam, filtering, and censorship

2003-11-11 Thread Marc Baber
I would say that spam, or at least the set of e-mails that one might 
want to be classified as spam, is *not* the same for everybody, in the 
case of politically motivated spam filtering.  Because the corpus body 
of collected spam e-mails is used to filter e-mail for all users, one 
person's spam report can affect e-mail delivered (or not delivered) to a 
large number of people.

The reason I talk about lost e-mails is because my account was defaulted 
into a delete spam mode when spam filtering was first introduced at 
EFN and I never saw filtered spam until I specifically contacted EFN to 
personally to ask that my account be exempted from that default.  I have 
no experience of receiving flagged spam as the default action for 
EFN's spam filter.  I had to lose an airline reservation e-mail and at 
least one job-seeking related e-mail before I became suspicious and 
started asking questions to learn that my account was defaulted to drop 
spam silently.  That was very frustrating for me and has made me the 
spam-filter-unfriendly guy I am today.

If there is to be a central corpus of spam for all users, I'd like to 
see some accountability and transparency:

1. Who makes the final decision if an e-mail submitted to [EMAIL PROTECTED] 
or [EMAIL PROTECTED] is included in the corpus as such.  What are the 
relevant policies?  Is it automated or staffed?
2. The corpus should be in an open web directory that is searchable. 
When the SpamAssassin says something is spam, there should be links to 
the reference e-mails in the corpus that were correlated with the spam, 
upon request, so a user can review whether the items in the corpus are 
objective spam or subjective spam.  The individual must have a way 
of reviewing the decisions or processes that contribute to the corpus.

nuf sed,

Marc

Bob Miller wrote:

Larry Price wrote:

The current setup is not the final setup, we are currently moving 
towards per user whitelists
but the Bayesian filter is probably most useful as a global weight. The 
trend I notice being that spam is generally more similar than 
different, otherwise every one's filter will have to learn the same 
patterns again.

Sure, spam is the same for everybody, but tofu is very very different.
For example, Marc sees the words Howard, Dean, campaign a whole
lot more frequently than I do, and I see chezgeek and kbob more
often.
Recognizing each user's tofu words separately significantly improves a
Bayesian filter's performance, at the cost that every user must train
his own filter.
Maybe I should give a EUGLUG talk on setting up bogofilter for
personal use.  I'd need some help from someone who understands POP and
IMAP -- I still use /var/spool/mail/$USER as my mail queue.
- 


___
EuG-LUG mailing list
[EMAIL PROTECTED]
http://mailman.efn.org/cgi-bin/listinfo/eug-lug


[eug-lug]Spam, filtering, and censorship

2003-11-10 Thread Larry Price
On Monday, November 10, 2003, at 03:02  PM, Marc Baber wrote:

Please forgive me for singling you (Larry) out as the expert on the 
subject of [EMAIL PROTECTED], but you've been very helpful on the topic more than 
once in the past and I suspect you're our best hope of  finding 
answers to a troubling problem:

Well I can't complain about being considered an expert however 
regrettable the topic.

I've been noticing that all e-mail coming to me from the local Dean 
campaign is flagged as [EMAIL PROTECTED] by EFN's servers.  
spamassassin is a filtering solution analogous to using a pitchfork on 
a river in flood,
it's effective at moving debris out of the way, but it's not exactly 
selective...

Had I not specifically requested and arranged for my [EMAIL PROTECTED] to be 
delivered instead of silently deleted, the EFN system would have 
deleted this e-mail without informing me and I would not be aware of 
the event announced in this e-mail.  
Being great believers in Free Speech and RFC observance, we do not 
throw mail away unless the member in question requests that we do so, 
we do recommend sending things to a spamcan directory that gets purged, 
or running local filters and use the tagging to presort.

Most people don't make the arrangements that I have to see their spam 
and decide for themselves.  I wonder how many EFN users opted into 
this list and will never hear about the upcoming event because of 
EFN's [EMAIL PROTECTED] filters?
So if we announce that we are globally whitelisting all mail from 
[EMAIL PROTECTED] (by announce I mean leave exposed to google as 
this message will be) are we gonna get stuck with a zillion messages 
starting off with.

From: [EMAIL PROTECTED]
Subject: Enhan(E yr m4n|-|00D
 I wonder how that may effect attendance at the event?  funds raised? 
 election outcome? (Yeah, I know, but hey, it was *real* close last 
time!)
(teehee)

On a more serious note, we are turning things off and making changes 
that are regrettable, making EFN a less friendly place just by the fact 
that it won't be possible to have the same feel as back in the days 
when you could finger anyone's user name and learn about them.
But the neighbourhood has gotten rougher, a lot rougher.

The overwhelming criterion for identifying this e-mail as [EMAIL PROTECTED] is the 
Bayesian [EMAIL PROTECTED] probability.  I'm familiar with the term Bayesian 
networks and understand that somebody has probably implemented a spam 
recognition heuristic based on such technology.
I did just tweak the scores a bit. (hey I got bit by this, I got a mash 
note that was labeled spam) So we shouldn't be seeing as many false 
positives.

My questions are:

1. Is EFN currently using [EMAIL PROTECTED] as the server name oswald 
suggests?  If so, what version?  If not, what [EMAIL PROTECTED] filter is being 
used?
yes oswald,booth and chapman are the spamassasins

2. The Bayesian probability suggests to me that there is a 
repository of reported [EMAIL PROTECTED] somewhere that incoming e-mail is compared 
against to see if it resembles any previously reported spiced-ham 
incidents.  This leads me to further questions:
a. Is the repository local to EFN or out there on the 'net.  If the 
latter, where (what URL(s)) is the repository?
I am not the keeper of the tofu, or of the canonical spam queue, so I 
don't know the details on this one, you might ask Mike or Patrick.

b. Who decides what reference e-mails go into the repositor(ies)? 
 What criteria do they use?  Can we check to see who may have reported 
*solicited* (not unsolicited) Dean campaign e-mail?
currently we have SA set to 'auto-learn' I am thinking this is a 
mistake;

If someone, say YOU, were to get together with at least one known and 
respected conservative (I would suggest Jim Darrough) and crank out a 
list of the candidates mail
servers that was even-handed, we would certainly use that in our 
prefiltering.

I think there is a huge incentive for political zealots to abuse spam 
filters for political goals if they can be so abused.  I want to make 
sure that there are mechanisms in place to prevent such censoring, 
both at EFN and elsewhere.  I welcome knowledge and insights from 
anyone familiar with popular spam filtering tech.
I admit to a much more bleak and cynical view; i think i will live to 
see the day when
i tell young people about the dark times of SMTP; and they will laugh 
at me in disbelief
when I tell them that there was very little one could do about someone 
sending you a message without a persistent identity attached. You mean 
you just Trusted people not to pretend they were someone else!!!  they 
will squeal and giggle and laugh at us benighted people from the 20th 
century.

Thank you for listening,  This issue is likely to become my favorite 
rant for the coming year :-),

Mine will be The Painful transition from smtp to xmpp or asmtp.
(yes you will need to buy a 5 dollar certificate to send mail through 
our system, on the other hand