http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5041

           Summary: large mail of CType 'message/partial' takes a long time
                    to scan
           Product: Spamassassin
           Version: 3.1.4
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Rules
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


Pasting from a users thread from Mark Martinec:

--------------------------------------------------------------------

I recently noticed a couple of cases where SA (3.1.4 or earlier)
would take over a minute (instead of few seconds) to check a 500 kB
message. Investigation reavealed that cases have one thing in common:
these were all message/partial chunks of a longish transfer of some
document or other data. Moreover, most of these cases were hitting
random sets of SARE or baseline rules, yielding false positives.

In case someone would suggest that Content-Type: message/partial
should be banned outright - well, it is a policy decision, and
if allowed, should not bring SA to its knees on a 0.5 MB message.

Here is one example where a command-line 'spamassassin -t -D' would
run for 68 seconds. Timestamping each debug line produces the
following top-10 lines - sorted by elapsed time, first column
is time in seconds for this line to appear after a previous one:

1.935 dbg: rules: ran body rule SARE_RMML_Stock1 ======> got hit: "0TC"
2.204 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
3.695 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0il"
3.976 dbg: rules: ran body rule __NONEMPTY_BODY ======> got hit: "i"
4.021 dbg: rules: running raw-body-text per-line regexp tests; score ... 
6.397 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " Sjx"
8.225 dbg: bayes: tok_get_all: token count: 37175
8.254 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
9.682 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"
11.999 dbg: rules: running body-text per-line regexp tests; score so far=2.501

and another example:

2.396 dbg: rules: ran body rule DISGUISE_PORN_MUNDANE ======> got hit: "b0y"
2.424 dbg: rules: ran body rule __SARE_SPEC_LRD_COST4 ======> got hit: "134"
2.627 dbg: bayes: tok_get_all: token count: 36631
3.421 dbg: rules: running body-text per-line regexp tests; score so far=0.203
3.826 dbg: rules: ran body rule SARE_RMML_Stock9 ======> got hit: "0Il"
4.181 dbg: rules: running raw-body-text per-line regexp tests; score ... 
4.265 dbg: rules: ran body rule FB_NOT_SEX ======> got hit: " S8X"
8.113 dbg: rules: ran body rule FUZZY_XPILL ======> got hit: "XoNOgX"
9.308 dbg: rules: ran body rule __SARE_SPEC_LRD_COST5 ======> got hit: "169"
9.945 dbg: rules: ran body rule __SARE_SPEC_LRD_COST6 ======> got hit: "218"

I know some of these are SARE rulesets, but some are baseline rules
or bayes token parsing.

Here is a relevant section/sample of one of these messages:

MIME-Version: 1.0
Content-Type: message/partial;
        total=22;
        id="[EMAIL PROTECTED]";
        number=21
X-Priority: 3
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook Express 6.00.2900.2869
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2869

f6idzxqa608aID8+YhwNSQwBpIrboHA0/zPfOP26mB6eONz70Xl12DwGVnAPemaaKaJyQk5ZKUwg
VC0sGYHLd543cICNa1piu8YgRJR0EaEK7GNVXvFSriat5dZwj7PNzQuOTO030bra7tBjROxbrVYR
XFStjnugVkyH27zqrvUdUsHYnLaVLdUuAxWH51QDV9/kc6vtIURcdUbthPszq12lj7Lt7rMAtVX7


So the problem is that these base64-encoded lines in a message/partial
chunk are treated as obfuscated text, which is very slow, and produces
almost random hits on various rules. It also places some burden on
SQL server (bayes: tok_get_all: token count: 37175).


Somewhat similar mail cases that also hit various obfuscation rules
because of its UU-encoding being mistaken for a plain text, is mail
with attachments produced by Microsoft Office Outlook where user
has the following setting chosen:

  Tools -> Options -> Mail Format -> Internet format: plain text options:
    (YES) Encode attachments in UUENCODE format
          when sending a plain text message

It would be nice if such encodings were recognized and at least
prevent rules that expect plain text from running and/or producing
false hits.

  Mark

--------------------------------------------------------------------


When I run a scan on my laptop here, using svn trunk and the default
ruleset, it takes 25 seconds; still pretty slow.

Issue #1: I guess this comes down to how a message/partial is treated in common
MUAs; as far as I can see, it's not displayed as text, therefore we shouldn't
scan it as text.

Issue #2: A side issue is that the ReplaceTags rules perform pretty badly on
500Kib files with 78-char, no-space lines.

Issue #3: an escape for UUEncoded messages.  We used to have this, but removed
it since it slowed down the common case to deal with the extremely rare case --
I seem to recall we checked our corpora, and none of us had a single UUE'd
message in over 5 years or so.  Has anyone used UUE in years?  If not, I'm -1,
even if Outlook stupidly still supports it.  (If we were to design SpamAssassin
based on MS product decisions, we'd be in as much of a mess as they are.)


Mark -- may I upload that sample to this bug?  Without it, everyone
else will be unable to reproduce the issue, test fixes, etc.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to