| What we are doing is to track the 2000 (user configurable) | most recent spammer | IP addresses. The list is maintained as an MRU style list | (sorted with the | most recent at the top). If incoming messages reach a user | defined score, the | IP address of the spammer is added to the list.
<snip> | Here is what we found. After about 3 weeks of data | collection, only about 1 in | 400 incoming spams is identified by a DNS lookup, and NOT on | the list of the | 2000 most recent spammers. Also, of all the spams we receive | on all accounts, | about 43% are on the recent spammer list, meaning that almost | half of the | spams we receive are from senders that have spammed us before. <snip> This is one of the capabilities we're buiding into Message Sniffer v3. Our testing has shown similar results, however there are some complexities with these tests particularly where "gray" sources are found. As a result our implementation will resolve the IP address & other "network centric" tests first as "features" of the message. These features then become part of the input stream for the bayesian hinting engine. (It should be noted that the "bayesian hinting engine" is really more a blend of fuzzy logic, neural networks, and naieve baysian learning techniques... it's just easier to use the current buzz-word to describe it...) So far our simulations indicate some profound accuracy imrpovements when "new" spam arrives, and surprisingly also when non-spam from "gray" senders arrives. The early analysis indicates that the learning engine is picking up second and third order patterns associated with these message features... This has the effect of "gating" the effect of some heuristics which are ambiguous under other circumstances so that they only count when they can be accurate. It seems obvious that as a weighted test, the top "n" most used IPs are a good bet - similarly a suggestion for research would be to apply a logarithmic scale to the MRU list position and use that as a weight... This scheme can be particularly useful if the list is dynamically scaled because the relative weights of different list positions can be maintained as the number of entries on the list changes... This is a similar mechanism to our "Rule Strength" analysis which is used to gate out rules that are currently inactive. (See http://www.sortmonster.com/MessageSniffer/Performance/CurrentRuleStrengt h.jsp) Another important factor we have found for these kinds of tests is that there tends to be a periodicity to message rates from some networks... the result of this is that in a linear MRU paradigm some networks will appear and dissappear from the list resulting in "late blocking" on the same period. That is, a batch of unwanted content will come through and cause the IP to go to the top of the list, but then the flow falls off and the IP is dropped. Next time unwanted content comes in from that IP it is let through the filter for a time because the IP is not on the list... shortly it will be blocked again but during that "build up time" a significant amount of the content might be delivered. A counter to this "pulsing" effect is to develop in increasing "persistence" to the more highly listed IPs so that they tend to stay on the list through the "down" period. Another important balance for persistence however is to reduce it's effects based on any ambiguous or false positive hits... in fact it turns out that this "persistence reduction" should have a persistence of it's own so that periodic false-positive indications can be suppressed when there is mixed content from the source. Note that periodicity, gating, and persistence mechanisms are useful on may heuristics - not just IP based tests. I hope these thoughts spark some new ones the prove helpful... :-) _M --- [This E-mail was scanned for viruses by Declude Virus (http://www.declude.com)] --- This E-mail came from the Declude.JunkMail mailing list. To unsubscribe, just send an E-mail to [EMAIL PROTECTED], and type "unsubscribe Declude.JunkMail". The archives can be found at http://www.mail-archive.com.
