RE: [Declude.JunkMail] OT: How to define "spam" and "ham"

Markus Gufler Tue, 21 Dec 2004 01:59:27 -0800

First of all spam is anything

comming from nonexistant, or forged senders
having "hidden" content

But what you're asking for is the difference between our human brain and stupid computers (Pete, your comment please ;-)

Generaly I simply try to keep our customers mailbox as clean as possible from all this automatic generated stuff. Human brains are so intelligent but computers are much faster to send out billions of messages in a very short time. Our life is short enough to not spend it on handling all this stuff manualy.

For sure: There is also legit automatic stuff. In this case the challenge is not to identify spam but to identify and let pass computer-generated ham.

One good qualification for "bad content" is the weighting system and combo tests. If many different tests fail on the same message we all know it's a good indicator of spam. If there is someone sending out legit messages failing many different tests then he definitively does something wrong and has to rethink what he does in order to do it successfull.

Consider the numerous "spam-filters" out there, blocking messages based on single indicators of spam. (for example if failing on one single IP-blacklist) or this pure text filter solutions catching only arround 60% of spam.

As long as there are such services I and my customers can live with the knowledge that nobody is 100% perfect.

At the moment I'm working on a new system that will clasify messages in the follwing 4 categories:

80 - 120% of our current hold weight => Subject: [spam low]
120% - 170% of our current hold weight => send out a notifcation to the recipient

The notifications are a little bit problematic:

As there are many customers using our server as gateway we doesn't know if the recipients adress is real existing. So at the moment I try to look if this recipient has received legit messages (<50% of the hold weight) in the previous - let's say - two weeks. This should prevent us to send out a big number of unneccessary messages (for example after dictionary attacks to gateway domains)
I want to send out as few notifications as possible. So I plan to generate them two times each day: the first time at around 9:00am of local time. The second at around 05:00 pm. With this strategy I hope to notify each recipient the same day as the false positive was hold on our system, but not more then two times each day, even if I have enough data to send notifications each hour. (if not recipients with a big spam volume would receive a notification each hour)
The notifiaction contains only a link (containing a long random string as access security) to a dynamic website. This website will show him a list (datetime /sender / subject) of all messages between 120 and 170% of our current hold weight. I believe we can't send out notifactions containing recipient addresses and subject lines in the body, as spam filters like them included in MS Outlook will block them another time.
With the dynamic website I can track the visits and so prevent any further notification until the customer has visited the website. This should reduce our notifications another time.

All this work with the notifiactions has the following benefits:

not we but our customers can decide what's ham and whats spam (at least in the mentioned grey zone)
customers can see our service
we have a copy of each "false positive" and can concentrate our work on preventing this in the future
beside the work of keeping the 120-170 zone as clean as possible from messages in order to reduce the review work of our customers (for example with my AVFILTER-COMBO test)

At the moment I'm working on this and so many ideas are still theory, but I'm happy for any feedback.

Markus

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Matt
Sent: Tuesday, December 21, 2004 3:48 AM
To: Declude.JunkMail@declude.com
Subject: Re: [Declude.JunkMail] OT: How to define "spam" and "ham"

Markus,

I have found that my users miss about 99% of the false positives using a system where I set up review accounts in Web-mail for each domain and only capture less than 2% of their blocked volume for them to review. Reprocessing and reporting the message is done with a single click using a link that I added to the interface for this purpose. I know that they miss this much because we also do review for the hold range across our entire user base, however we don't guarantee in any way that we will find every false positive or review this with specific regularity. Obviously as volume increases, so does the work required for us to do this, but it is quite easy for all but a couple of our domains to be reviewed because the number of held messages are generally below 20 a day, and only 7 days are kept.

I too am looking to move to a 'push' format, figuring that if you deliver a message daily to each domain's administrator showing this small sampling where false positives are almost exclusively held, I will dramatically increase the amount of user feedback and more importantly, lessen the dependence on us. I have only had one customer that was ever upset about false positives, and this customer dropped us. The issue there was that the domain owner's wife was very big on free-deals sites, and their daily E-mails were often being blocked and they never gave us enough time to clean up all of the issues. Personally, I don't feel that our service is appropriate for people that value such things so highly, especially since so much of it is associated with spam (shared or brokered lists).

So having this Web-mail review for each domain has in fact provided us with feedback from those few that feel that this is important to them. I have found that the people driven enough to do the review do in fact often report false positives for sources like eWeek and Orbitz, even things as pointless as surveys. I do very much appreciate the feedback, and I have killed entries from my own blacklist repeatedly as a result of these reports after finding that people did want to make their own choices with the tertiary stuff. Since they are also generally tech types, they favor tech content, probably due to familiarity as well as favorites. Those that don't regularly do review however are much less likely to report advertising or low-value newsletters/subscriptions as being false positives. These types are also strangely enough much more likely to report a phishing attempt as a false positive, and that has happened 3 times that I can recall (I'm improving my phishing filters to get this stuff deleted more often instead of just held). So the gist of this is that I get the feeling that just like us, the administrators of these individual domains have their own sub-conscious rules that they use.

This is all a bit secondary to my original inquiry however. What I was really interested in was what rules people like yourself use in determining whether something is ham or spam.

Thanks,

Matt

Markus Gufler wrote:
I'm close to finish a reporting tool that will send out a daily notification to the local recipient if new messages was hold on the mailserver with a final weight slightly above the hold weight (up to now we review this messages regulary and can find an average of one false positive each day by around 15k delivered messages)

The notification contains only a link to a webpage where the user can see his hold messages and klick on it to requeue them.

I'm curios what my customers will consider "not spam" :-)

Markus
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of Matt
Sent: Monday, December 20, 2004 2:02 PM
To: Declude.JunkMail@declude.com
Subject: [Declude.JunkMail] OT: How to define "spam" and "ham"

This was the subject of a recent off-list discussion between myself and Pete where there was a perception that my definition of spam was too conservative or rather my definition of ham was too liberal. While I readily admit that in practice, I do personally wish to block many fewer things that I consider to be legitimate first-party advertising than most do, I don't necessarily get the impression that the definitions that I use are all that much off the mark. I have also found that the folks at BondedSender think that I am some sort of anti-advertising zealot for reporting what is near universally what we would consider to be spam, so it does go both ways :) So I wanted to throw this topic out for some feedback and other presentations of one's own definitions and maybe learn something in the process.

First off, I naturally follow the basic definition of spam that is widely promoted where spam is both unsolicited and bulk. What causes such wide derivation from this common definition however is the sub-definition of what constitutes unsolicited, and the gray area that exists beyond this definition due to abuse.

The definition that I use to qualify advertising or newsletter related ham is as follows:

This definition starts with me treating things as ham if it comes from a first-party relationship with the sender, however there are some exceptions as follows:

Evidence of the first-party having harvested significant numbers of recipients in the list, i.e.: Reunion(dot)com.
Refusal to honor opt-outs.
Having no opt-out mechanism for repeated E-mails that are advertising related.
Third-party ads being sent by first-party source when they are not the primary reason for a membership, example: Sportsline's partner specials
Very widespread abuse of a particular direct-marketing provider where most customers of a service are spamming, example: Uptilt.
Selling subscriber lists from one otherwise legitimate site to spammers or brokering lists for spamming, example: many joke sites.

It's my belief that many would consider this definition to be agreeable (please speak up if you don't), however I am near certain that in practice there is a good amount of derivation from this even among those that would at least initially agree with the above.

The issue of applying this in practice to me means that I try not to apply my own emotions or judgments of value to a particular sender. This means that I treat advertising from J.Crew just the same on my system as E-mail from Orbitz, though I personally find Orbitz and most other travel sites to be annoying with their frequency and low in value to the recipient. The trick here is that I have found no evidence of harvesting from either source, and they both practice default-opt-in to their newsletters from their customer-base, and they both seemingly honor opt-outs, so the only difference that I perceive is the subject matter of the E-mails. I have found that many administrators will blacklist Orbitz and even report them to SpamCop, while this is less commonly the case with J.Crew. So the determining factor that is often used regardless of a stated or intended definition appears to be a value judgment placed on the content of the E-mails, either consciously or unconsciously. Would anyone agree or disagree with this perception?

One last note: personally I find the industry standard practice of default-opt-in for customer lists to be disturbing, but if one was to consider that alone as a qualifier of spam, over 99% of advertising messages that pass my definition above would fail the much tighter definition of double-opt-in for requesters only. Since this has become the standard practice in the entire industry, I allow for it just so long as they follow my rules since I definitely have customers (including myself) that do wish to receive some of what is sent to me without initially requesting it, and my customers have the power to opt-out and report any abuse to me for appropriate action.

Please add your comments or even your own definitions.

Thanks,

Matt
-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================
-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================

RE: [Declude.JunkMail] OT: How to define "spam" and "ham"

Reply via email to