http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927
Summary: Suggesting a rule to test for double Subject or double
From
Product: Spamassassin
Version: 3.1.2
Platform: All
OS/Version: All
Status: NEW
Severity: minor
Priority: P5
Component: Rules
AssignedTo: [email protected]
ReportedBy: [EMAIL PROTECTED]
After noticing a spam message with two Subject header fields that
got through, I tested all our site's mail traffic for couple of days,
watching for message with multiple occurrences of header fields,
which (according to RFC 2822) may occur at most once.
Here is a suggested new rule:
header __DOUBLE_SUBJ ALL =~ /^Subject:.*^Subject:/smi
header __DOUBLE_FROM ALL =~ /^From:.*^From:/smi
meta DOUBLE_SUBJ_OR_FROM __DOUBLE_SUBJ || __DOUBLE_FROM
describe DOUBLE_SUBJ_OR_FROM Contains more than one Subject or From header
score DOUBLE_SUBJ_OR_FROM 2.0
Here is the analysis.
First, looking at messages counts with multiple header fields:
count multiple header fields present
----- ------------------------------
160 Subject
173 From
122 From AND Subject
333 From OR Subject
37 Subject AND NOT From
52 From AND NOT Subject
47 Message-ID
6 Reply-To
5 Sender
6 To
0 Cc
Seems line multiple Cc, To, Sender and Reply-To are infrequent
and probably not worth the trouble.
Multiple Message-ID occur more frequently, but according to attached
diagram seem to occur in non-spam mail as well(?), so it seems it can
trigger false positives (but it may be useful to re-evaluate this).
Presence of multiple From or multiple Subject header fields seem to be
a very good indication of spam, with not a single FP in my three-day
sample. The two messages that did score below 5 were manually re-checked
and turned out to be spam or a crippled spam message.
A remaining question is how to combine __DOUBLE_SUBJ and __DOUBLE_FROM
tests. To score each one individually, or to score on a metarule on some
combination of the two (OR, AND, AND NOT).
Manually checking messages that match 'Subject AND NOT From'
as well as 'From AND NOT Subject' doesn't make me believe these
two would be more useful that each rule individually.
Although 'From AND Subject' hits quite frequently, it doesn't have
less false positives or improved hit rate. Seems like 'From OR Subject'
covers most cases with good quality, which makes me suggest a single
DOUBLE_SUBJ_OR_FROM metarule, in favour of scoring each individual
DOUBLE_SUBJ / DOUBLE_FROM rules.
It would be interesting how automatic score assignment evaluates the rule.
As an illustration, there are two diagrams attached, the second one
is just a magnified left-hand side detail of the first one.
X-axis shows distribution (centiles) of all mail which hits each rule,
and y-axis is a score that SA assigned to a message (SA 3.1.2, all usual
network tests enables, bayes, razor, dcc, common SARE rules).
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.