| What we are doing is to track the 2000 (user configurable) 
| most recent spammer
| IP addresses. The list is maintained as an MRU style list 
| (sorted with the
| most recent at the top). If incoming messages reach a user 
| defined score, the
| IP address of the spammer is added to the list.

<snip>

| Here is what we found. After about 3 weeks of data 
| collection, only about 1 in
| 400 incoming spams is identified by a DNS lookup, and NOT on 
| the list of the
| 2000 most recent spammers. Also, of all the spams we receive 
| on all accounts,
| about 43% are on the recent spammer list, meaning that almost 
| half of the
| spams we receive are from senders that have spammed us before.

<snip>

This is one of the capabilities we're buiding into Message Sniffer v3.
Our testing has shown similar results, however there are some
complexities with these tests particularly where "gray" sources are
found. As a result our implementation will resolve the IP address &
other "network centric" tests first as "features" of the message. These
features then become part of the input stream for the bayesian hinting
engine.

(It should be noted that the "bayesian hinting engine" is really more a
blend of fuzzy logic, neural networks, and naieve baysian learning
techniques... it's just easier to use the current buzz-word to describe
it...)

So far our simulations indicate some profound accuracy imrpovements when
"new" spam arrives, and surprisingly also when non-spam from "gray"
senders arrives. The early analysis indicates that the learning engine
is picking up second and third order patterns associated with these
message features... This has the effect of "gating" the effect of some
heuristics which are ambiguous under other circumstances so that they
only count when they can be accurate.

It seems obvious that as a weighted test, the top "n" most used IPs are
a good bet - similarly a suggestion for research would be to apply a
logarithmic scale to the MRU list position and use that as a weight...
This scheme can be particularly useful if the list is dynamically scaled
because the relative weights of different list positions can be
maintained as the number of entries on the list changes... This is a
similar mechanism to our "Rule Strength" analysis which is used to gate
out rules that are currently inactive. (See
http://www.sortmonster.com/MessageSniffer/Performance/CurrentRuleStrengt
h.jsp)

Another important factor we have found for these kinds of tests is that
there tends to be a periodicity to message rates from some networks...
the result of this is that in a linear MRU paradigm some networks will
appear and dissappear from the list resulting in "late blocking" on the
same period. That is, a batch of unwanted content will come through and
cause the IP to go to the top of the list, but then the flow falls off
and the IP is dropped. Next time unwanted content comes in from that IP
it is let through the filter for a time because the IP is not on the
list... shortly it will be blocked again but during that "build up time"
a significant amount of the content might be delivered.

A counter to this "pulsing" effect is to develop in increasing
"persistence" to the more highly listed IPs so that they tend to stay on
the list through the "down" period. Another important balance for
persistence however is to reduce it's effects based on any ambiguous or
false positive hits... in fact it turns out that this "persistence
reduction" should have a persistence of it's own so that periodic
false-positive indications can be suppressed when there is mixed content
from the source.

Note that periodicity, gating, and persistence mechanisms are useful on
may heuristics - not just IP based tests.

I hope these thoughts spark some new ones the prove helpful...

:-)

_M

---
[This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)]

---
This E-mail came from the Declude.JunkMail mailing list.  To
unsubscribe, just send an E-mail to [EMAIL PROTECTED], and
type "unsubscribe Declude.JunkMail".  The archives can be found
at http://www.mail-archive.com.

Reply via email to