http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927

           Summary: Suggesting a rule to test for double Subject or double
                    From
           Product: Spamassassin
           Version: 3.1.2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Rules
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


After noticing a spam message with two Subject header fields that
got through, I tested all our site's mail traffic for couple of days,
watching for message with multiple occurrences of header fields,
which (according to RFC 2822) may occur at most once.
Here is a suggested new rule:

header __DOUBLE_SUBJ  ALL =~ /^Subject:.*^Subject:/smi
header __DOUBLE_FROM  ALL =~ /^From:.*^From:/smi
meta     DOUBLE_SUBJ_OR_FROM  __DOUBLE_SUBJ || __DOUBLE_FROM
describe DOUBLE_SUBJ_OR_FROM Contains more than one Subject or From header
score    DOUBLE_SUBJ_OR_FROM 2.0

Here is the analysis.
First, looking at messages counts with multiple header fields:

  count  multiple header fields present
  -----  ------------------------------
  160    Subject  
  173    From
  122    From AND Subject
  333    From OR  Subject
  37     Subject AND NOT From
  52     From AND NOT Subject
  47     Message-ID
  6      Reply-To
  5      Sender
  6      To
  0      Cc

Seems line multiple Cc, To, Sender and Reply-To are infrequent
and probably not worth the trouble.

Multiple Message-ID occur more frequently, but according to attached
diagram seem to occur in non-spam mail as well(?), so it seems it can
trigger false positives (but it may be useful to re-evaluate this).

Presence of multiple From or multiple Subject header fields seem to be
a very good indication of spam, with not a single FP in my three-day
sample. The two messages that did score below 5 were manually re-checked
and turned out to be spam or a crippled spam message.

A remaining question is how to combine __DOUBLE_SUBJ and __DOUBLE_FROM
tests. To score each one individually, or to score on a metarule on some
combination of the two (OR, AND, AND NOT).

Manually checking messages that match 'Subject AND NOT From'
as well as 'From AND NOT Subject' doesn't make me believe these
two would be more useful that each rule individually.

Although 'From AND Subject' hits quite frequently, it doesn't have
less false positives or improved hit rate. Seems like 'From OR Subject'
covers most cases with good quality, which makes me suggest a single 
DOUBLE_SUBJ_OR_FROM metarule, in favour of scoring each individual 
DOUBLE_SUBJ / DOUBLE_FROM rules.

It would be interesting how automatic score assignment evaluates the rule.

As an illustration, there are two diagrams attached, the second one
is just a magnified left-hand side detail of the first one.
X-axis shows distribution (centiles) of all mail which hits each rule,
and y-axis is a score that SA assigned to a message (SA 3.1.2, all usual
network tests enables, bayes, razor, dcc, common SARE rules).



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to