Markus,

Thanks for the stats.  I've actually been keeping copies of all of the false positives that we are reprocessing since Monday.  Here's a break down by the sender (considering that some newsletters and ads are sent to multiple recipients and that throws off the numbers):
1 - < 0.5 KB
1 - 0.5 KB to 1 KB
5 - 1 KB to 5 KB
2 - 5 KB to 10 KB
2 - 10 KB to 15 KB
6 - 15 KB to 20 KB
0 - 20 KB to 30 KB
1 - 30 KB to 40 KB
2 - 40 KB to 50 KB
0 - 50 KB to 75 KB
2 - 75 KB to 100 KB
1 - 100 KB to 200 KB
1 - 200 KB to 300 KB
1 - > 300 KB
I'm mostly concerned about false positives and performance currently, and while our FP rate is regularly below 0.02% now, this still takes almost as much time to find problems and fix them as it did when our rate was many times more that.  I need to therefore balance the potential of causing FP's with adding points for weight with the incremental benefit of being able to block a small extra percentage of spam, and err heavily to the side of protecting from FP's.

Also note that I am very liberal in classifying good E-mail, allowing through anything where the recipient has a first-party relationship with the sender.  FootLocker.com for instance sent me two ads in a week for the first time since I bought something from them 20 months ago.  I figure that as long as they honor my opt-out (despite not every opting-in to their ads), this protects those that want the content from having it blocked.  Unfortunately many administrators consider this stuff to be spam, and it makes my job more difficult because of reports to SpamCop, Sniffer, and other places that nominate such things.  While this stuff may be spam, people should also take note of the limitations of the blocking mechanism to differentiate between spam from a particular source, and a legitimate E-mail from that source or containing similar links.  If you can't differentiate, administrators should seek out a better method IMO.  Anyway...

I've done some review of our held spam that scores between 10 and 24 points on our system (a 150% boundary) and for instance so far in the past 4 days every message held over 100 KB was a FP from an individual (the worst kind).  There's definitely spam between 30 KB and 100 KB, but as a percentage, this also represents an area where messages falling in that range are far more likely to be a false positive because newsletters from dirty sources often enough come in over 30 KB, while opt-in spammers don't generally bother with that much content and zombie spammers certainly don't (for now at least).

My thoughts about the weight test are two fold.  For one, I'm really only interested in adding points to zombie spam since static spammers can be caught once and then their whole IP space can be blacklisted.  Static spammers aren't very dynamic outside of their owned blocks, and I'm not very concerned about proactive protections using a message size filter.  Zombie spam though is almost always below 5 KB, and sometimes below 0.5 KB.  If I can narrow this down to 99.9% of it falling below a certain size, I can use the size test to defeat my processor intensive filters like GIBBERISH, IPLINKED and @LINKED among others.  Yesterday I managed to skip processing these filters on 5% of my mail volume when set to only run below 30 KB in size.  If that magic number is more like 5 KB, I can save much more in terms of processing power.  Another added benefit is that when you don't run a filter on messages above a certain size, you limit the potential of a false positive with that filter.  For instance, I see plenty of FP's on IPLINKED in newsletters, but this filter is built to target zombie spam, not spam from static sources which are easily tagged.  So in effect, even without subtracting points, and just using larger sizes to defeat certain tests, this protects from FP's and saves processing power.

So far I'm differentiating between filters built for static sources or a mix, and filters built specifically for zombie spam, and not processing those types according to different message sizes.  I'm probably only going to add points to things below 0.5 KB, and this will only be 10% to 20% of my hold weight.  I did see some FP's from 0.5 KB to 1 KB, mostly very brief messages that just scraped under the limit.  I'm going to try looking for the minimum size of a message sent from a legit mail client and only add points below that point.  The sweet spot for zombie spam certainly appears to be below 5 K, but I have to do some more research on that.  Unfortunately I can't parse the COPYFILE message bodies for headers so that I could more effectively identify the zombie stuff.

For those that have asked or are interested in the weight filter, what I'm going to do is set it up with the ability to set 7 different ranges by way of the arguments in a comma delimited string.  This way everyone can tune it to their own needs.  The skipping of the filter will also be configurable with arguments as long as you are using 1.79i4+.

Matt







Markus Gufler wrote:
Cheer up :)
    

No problem. Just wondered about the 8 minutes.  :-)
I know that in Declude we have a great tool and I can't have it 100% as I
want.

Hope your external test will work fine and you can add additional tests. 
As we check for message sizes in SpamChk for over a year now maybe I can
give you some input about my observations.

What about the idea to use this script as an "external weight" test and let
return the script the result as weight? So you have one single test in the
declude.cfg file and you can return whatever weight you want directly to the
delcude weighting system.

For example I've seen that around 50% of al incomming spam is under 5
kBytes.
However there are spam messages up to 100 kBytes. (see attached diagram
based on around 20000 hold spam messages on our server in the last 4 days)

Based on this values we've decided to give a very small negative weight to
messages having less then 32 kByte. More negative points for messages having
at least 48 kBytes and another more neg. points for messages having more
then 64 kByte

Theoreticaly it should be a good idea to return the result directly
dependent on the file size. So for example the minimum file size for a
negative weight should be 30 kByte. This should return e negative weight of
5% of the hold value. (-1 point for hold-on-20) The returned negative weight
should be increased for every additional 10 kBytes by 5% of the hold weight.


Size  Weight
10	0
20	0
30	-1
40	-2
50	-3
60	-4
...
100	-8
...
220	-20


On my server I can see the following variation of message file sizes:

12% 	>64 kByte
2%	48 to 64 kByte
6%    32 to 47 kByte
80%	>32 kByte


I consider negative points for large messages as relative secure because
spammers - even if using an army of zombies - can't easily send out a large
quantity of spam of this size.

Markus



  

-- 
=====================================================
MailPure custom filters for Declude JunkMail Pro.
http://www.mailpure.com/software/
=====================================================

Reply via email to