I've been testing this for almost a day and have had very good results with this filter as it is catching spam all the time...over 1/3 of my total mail volume is being tagged in fact.

Here's how it works. Like the Gibberish subject test, this searches for strings of characters not found commonly in communications. Since Base64 encoding has to be scanned with text filters at this time, the filter will automatically trip on any Base64 content because of how common strings with Q are in the encoding. In order to offset this effect, it searches for "attachment;" which is required for any non-inline content, and gives back points. Since this code isn't associated with inline Base64 content, it won't get tripped there and has the net effect of acting just like Declude's BASE64 test. If you test this out, you are advised to reduce the score of BASE64 by the exact score of this test. Again, this test gets tripped by all attachments, but it doesn't change their score. I've found that inline BASE64 only accounts for less than 20% of the hits.

If you don't use BASE64 test because of foreign languages or other similar issues, that test can be scored negatively in order to offset the effects of the inline detection by this filter so that only displayable text and HTML will produce a change in score. That includes non-displayable gibberish text in brackets.

False positives are bound to happen, however their occurrence is fairly low. Since HTML code is also searched, it will find matches in some URL's, especially ones with a tracking capability such as those used by Yahoo! Groups (in the ad sent with listserv postings) and Buy.com, and even less often it will find a match in regular wording, primarily with the use of acronyms.. I'm very interested in hearing about more FP's if you find them.

The filter is designed to be used with v1.75 of declude without the decoding turned off (default on). It can be modified to work with older versions of Declude by changing the "attachments;" offset to "base64" in which case it won't detect any Base64 unless it is not appropriately tagged (useful).

I think this is a killer test. Enjoy.

Matt


# GIBBERISH
# Last Update: 09/12/2003
#
# Description:
# Finds gibberish in the body of the message, including comment blocks.  Will be 
triggered on
# any Base64 encoding due to how common Q combinations are.  A negative weight for 
attachments
# defeats the test, however inline base64 encoded content will receive full scoring.  
The BASE64
# test should be reduced by the score of this test in order to compensate for this 
fact.
#
# Usage:
# GIBBERISH     filter     C:\IMail\Declude\Gibberish.txt     x     5     0
#
# False Positives
# Will result primarily from URL's containing random looking strings.  Known offenders 
include
# Buy.com and Yahoo! Groups.



# The following defeats the test if it finds an attachment.

BODY            -5      CONTAINS        attachment;


# Small list of letter combinations not found in a basic dictionary.

BODY            0       CONTAINS        qb
BODY            0       CONTAINS        qc
BODY            0       CONTAINS        qd
BODY            0       CONTAINS        qf
BODY            0       CONTAINS        qg
BODY            0       CONTAINS        qh
BODY            0       CONTAINS        qi
BODY            0       CONTAINS        qj
BODY            0       CONTAINS        qk
BODY            0       CONTAINS        qm
BODY            0       CONTAINS        qn
BODY            0       CONTAINS        qo
BODY            0       CONTAINS        qp
BODY            0       CONTAINS        qr
BODY            0       CONTAINS        qs
BODY            0       CONTAINS        qt
BODY            0       CONTAINS        qv
BODY            0       CONTAINS        qx
BODY            0       CONTAINS        qy
BODY            0       CONTAINS        qz

BODY            0       CONTAINS        vq
BODY            0       CONTAINS        wq
BODY            0       CONTAINS        tq
BODY            0       CONTAINS        jq

BODY            0       CONTAINS        xd
BODY            0       CONTAINS        xj
BODY            0       CONTAINS        xk
BODY            0       CONTAINS        xr
BODY            0       CONTAINS        xz

BODY            0       CONTAINS        zb
BODY            0       CONTAINS        zc
BODY            0       CONTAINS        zf
BODY            0       CONTAINS        zj
BODY            0       CONTAINS        zk
BODY            0       CONTAINS        zm
BODY            0       CONTAINS        zx

Reply via email to