Re[2]: [Declude.JunkMail] OT: How to define "spam" and "ham"

Pete McNeil Tue, 21 Dec 2004 13:12:36 -0800

On Tuesday, December 21, 2004, 1:06:58 PM, Matt wrote:

M> Pete,

M> I'm still exploring this topic, or at least trying to...hoping for some
M> others to share their own definitions or practices (nudge, nudge, wink,
M> wink) so the sample would be slightly more scientific.

Me too. It might be hard to get scientific about it though --- My
suspicion / experience is that most folks are not scientific about it
really. I think it is common to take an "I know it when I see it"
approach to defining spam. It's good to get some more data on this
though.

M> I am certainly not at all looking to convince anyone to change their own
M> definitions.  Instead my goal is to try to further the awareness of the
M> differences that may or may not exist and hopefully apply this 
M> programatically and maybe in policy to the way that either Sniffer 
M> works, or I work with Sniffer...or both.  I might also find that I need
M> to change my own implementation of the definition that I use because as
M> Marcus stated, "life is short enough to not spend it on handling all
M> this stuff manually."  Fixing FP's on ads is a thankless job most of the
M> time.

In a way I've taken an open ended approach to this - as a matter of
design I've stated that we do not know, nor can we know with any
certainty what the policies and definitions of our customers are going
to be, so the goal is to continuously learn and approximate this
knowledge in a core rulebase and then drive any needed specificity
into the user's ruelbases.

It could be (I think it is) that in a world where there is no hard
definition - or at least no such definition that satisfies all users -
this open ended approach is better able to cope than one which is
attempts to be more rigid at it's core.

Perhaps the extreme effort that I know you put into your system is
evidence of the stress between a fuzzy reality and a rigid concept -
you are "filling in the gaps" with personal effort.

M> I do understand the balance that works for Sniffer in handling such
M> matters, but I don't want to be the guy that reports FP's for the things
M> that another user reports as spam.  One of us would be wasting our time
M> and pissing off the other.  The other day for instance, someone manually
M> reported the HarryandDavid first-party ad, and then I manually reported
M> it as a false positive.  Who is right?  Because of this, and regardless

In SNF you both are. Sometimes when this conflict arises the core will
define the content as spam (filtered) and sometimes ham (not
filtered). This decision depends upon the available statistics. After
that - one or the other specific rulebase will be changed to
accommodate the difference - either blocking the rule, or whitelisting
the content, or adding a specific black rule, etc...

Here in the boundaries there is always some additional effort (cost)
required. One of the key elements to the system is the diversity of
opinions that drive it. As a matter of practice, yours tends to be off
center - so you get more of these conflicts than most of our users.

It would be a real shame if the costs (time, effort, etc...) caused
you to go silent. In the end the system is only going to be as good as
the effort we all put in.

M> of the present system for handling such things, I do think that Sniffer
M> should have a definition for this type of E-mail and a generalized set
M> of rules to follow (soft edges of course).  Today for instance you 
M> decided to bring backscatter into your definition of spam/unwanted 
M> E-mail, a fully conscious choice, and one that needed to be done with
M> purpose and qualification.  I believe that when it comes to first-party
M> advertising, this should be done similarly when it comes to qualifying
M> manual reports of both false positives and false negatives, and also in
M> qualifying some tertiary links that can land in spamtraps assigning
M> guilt to an innocent source (maybe the association is guilt enough).

Many of the elements involved are difficult to measure and predict -
so I'll respond by comparing the two proposed mechanisms and hope that
this keeps us on topic:

One is relatively easy (cost/benefit) while the other is relatively
hard.

When attempting to filter backscatter we can formulate a process that
has a high probability of success with the resources at hand - and in
addition the statistics show that we would have almost no conflict
with our customers if we did this. (Many have expressed an interest in
this and none have expressed a desire to protect these messages.)

When considering first party advertising things are very different. It
is difficult to formulate a process that can capture the required data
in real-time or even near real-time - and our data shows that we
already have a high degree of conflict in this area. Some customers
aggressively submit messages that they want filtered which others do
not. Where the system meets your edge the data is very clear. The vast
majority of rules that you have removed for your system have been
submitted by users (occasionally even spamtraps), have been in the
system for more than 100 days -- (sometimes years), and have had very
few or more commonly zero prior false positive reports. As a result
it's easy to conclude that if SNF establishes any hard guideline on
this kind of content we will spend a lot of resources implementing
the policy, fail to implement the policy accurately much of the time,
and consistently generate conflicts with large groups of our customer
base.

-- so, for now, the decision is to allow the learning mechanisms that
we have built in to find the best balance based on each instantaneous
case. If/when conditions change we will rethink the decision - after
all, we always want to improve.

Just to clarify, I don't want my comments to infer that you are at
odds with us, nor do I intend to make any judgements about your
policies or direction. In fact I admire them and highly value your
input. You're just a great example of the conflict in the data :-)

An additional note: The rules that are forced out of the core by
your FP reports are, by design, those that have a relatively low hit
rate and therefore a low probability of conflict. It is my belief that
many of these rules could be causing false positives that simply go
un-noticed or un-reported. To the extent this is true, your inputs are
very beneficial --- if not you then who?

M> Although you allow for customizations among your individual clients to
M> handle such differences, this is not the best use of any of our time to
M> feel our way through this unless it is a part of a process of finding a
M> larger consensus.

It definitely is. The system is designed that way. Every participant
provides valuable data that moves "the center" and enriches the
system's "knowledge". This is, in fact, how we can do what we do in
the first place. A core element of the design can be summed up in this
question which every element is constantly asking and applying: How
can this piece of data be applied to the benefit of every other
element? We learn a new piece of spam once, and everyone is protected.
We find a false positive once, and by-and-large, everyone is protected
again. Soon (and to some extent now) if a few elements can collaborate
to develop new knowledge, then that too is applied to global benefit.

This principle of leverage is at the core of how SNF develops it's
value... So, every interaction is very important and is encouraged.

M> I am not of course so bold as to suggest that my preference would be the
M> best choice for anyone but myself, and hence the query to the list for
M> feedback.  I also think that the discussion could be fruitful in many
M> other regards...if people would be willing to share their opinions.

Absolutely.
_M

M> Pete McNeil wrote:

>>On Tuesday, December 21, 2004, 4:49:33 AM, Markus wrote:
>>
>>MG> First of all spam is anything
>>    
>>MG> comming from nonexistant, or forged senders
>>    
>>MG> having "hidden" content
>>
>>MG> But what you're  asking for is the difference between our
>>MG> human brain and stupid computers (Pete,  your comment please ;-)
>>
>>Well... I'm having fun lurking and I don't want to spoil that. I'm
>>anxious to learn what folks are thinking about all of this (without my
>>nudging).
>>
>>The current implementation of Sniffer is a kind of broad spectrum
>>hybrid learning system. We use statistical models to try and keep the
>>core rulebase targeting what our users _seem_ to want filtered then we
>>customize individual rulebases to match specific preferences. The
>>learning model isn't perfect, but it has shown that by and large there
>>is a strong agreement for most folks about what should be filtered -
>>even if that definition cannot be clearly and consistently stated.
>>
>>(Note I did not say "what is spam" because that is getting to be more
>>precise and more contentious these days.)
>>
>>What I find (and it really stands out when working with Matt) is that
>>the definition indicated by the standing rules in our core rulebase
>>is a mixed bag of features and that the definition is highly fluid
>>around the edges.
>>
>>For example, in large part Matt's rules would indicate traffic from
>>chtah is "not spam" but even he admits it's not acceptable to make
>>that definition hard (not ok to white-list chtah).
>>
>>One more liberal definition of ham holds that if the recipient has a
>>first party relationship with the sender then any content from that
>>sender should not be filtered... Clearly from the volume of direct
>>advertising that is submitted to us as spam (even as recurring spam
>>problems) this definition does not hold for most of our users.
>>
>>This "edge definition problem" was predicted and so far our model is
>>doing a reasonably good job of dealing with it - though improvements
>>are clearly needed and are on their way (albeit slowly).
>>
>>In the mean time, end-user specific bayesian classification can often
>>solve the edge problem -- thus reinforcing that the fluidity at the
>>edge is largely due to differences in the filtering preferences of the
>>end users and the variability thereof.
>>
>>Add to that the problem of data collection and the problem becomes not
>>only difficult to solve, but difficult to measure --- Imagine piloting
>>a supersonic fighter jet through a narrow winding canyon with your
>>eyes shut and you've just about got the picture.
>>
>>As for the stupidity of machines... I personally believe that strong
>>intelligence can be built artificially (and in fact I do that for fun
>>and profit)... The big challenge with using AI for spam is the same as
>>for many AI systems where people's expectations are concerned: The AI
>>cannot and does not have a human frame of reference and so even if it
>>did match or exceed the innate intelligence of a human counterpart, it
>>would not be in a position to predict or model human behaviors
>>precisely.
>>
>>Said another way (partly tongue in cheek) - since computers don't have
>>sex, they don't grok porn and (ahem) organ enhancement spam.
>>
>>Without a social frame of reference they are reduced to guessing at
>>otherwise meaningless patterns. You or I could do no better in that
>>world.
>>
>>So, what we do with the design of Sniffer is to build a highly
>>integrated hybrid with both human and machine components. Each gives
>>the other strong leverage where it's needed. The machines remember
>>better than we do, find and learn patterns well, and manage large
>>datasets without too much effort. The humans understand the social
>>contexts, predict and decode the strategies that are used by spammers,
>>and interpret the needs and desires of our customers.
>>
>>I think I might be rambling...
>>
>>Were these the kinds of comments you were looking for?
>>
>>_M
>>
>>
>>
>>---
>>[This E-mail was scanned for viruses by Declude Virus 
>>(http://www.declude.com)]
>>
>>---
>>This E-mail came from the Declude.JunkMail mailing list.  To
>>unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
>>type "unsubscribe Declude.JunkMail".  The archives can be found
>>at http://www.mail-archive.com.
>>
>>
>>  
>>

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Re[2]: [Declude.JunkMail] OT: How to define "spam" and "ham"

Reply via email to