[Bug 7115] New: Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

bugzilla-daemon Fri, 19 Dec 2014 11:38:42 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115


            Bug ID: 7115
           Summary: Adding SHA digests of MIME parts as Bayes tokens
                    allows bayes to 'see' non-textual content
           Product: Spamassassin
           Version: 3.4 SVN branch
          Hardware: All
                OS: All
            Status: NEW
          Severity: enhancement
          Priority: P2
         Component: Libraries
          Assignee: [email protected]
          Reporter: [email protected]

Created attachment 5262
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5262&action=edit
suggested change

As promised, here is an enhancement to bayes token-collecting code
to also take into account message-digests (currently SHA1) of
*each* leaf MIME part, regardless of it being textual or non-textual,
and including all alternative-parts.

The idea is based on a suggestion (made earlier this year, May 2014)
by Andreas Schulze, who experimented with collecting and analyzing
MIME part digests in Amavis, with interesting results.
It seems to me a natural next step is to feed this data to the
existing Bayes classifier in SpamAssassin and let it do its magic.

Besides allowing bayes to notice also non-textual mail content
like attached icons, photos, PDF, Office documents, powerpoint,
encrypted or compressed parts, it also 'sees' textual parts
'as a whole', including such parts as ASCII-art -only, mostly
empty parts, etc.

The code is fairly straightforward, just takes advantage of
existing Base64 and quoted-printable decoding, and existing
Digest::SHA or older Digest::SHA1 module, the same as already
used by the Bayes plugin.

If a caller already has MIME part digests computed, it may
pass them to SpamAssassin and avoid duplicate processing.
This also makes it possible for SpamAssassin's Bayes to notice
digests of *all* MIME parts, even when as message is very large
and only partly passed (truncated) to SpamAssassin.

Early results are encouraging. Observing the top 5 bayes tokens
as reported by macros HAMMYTOKENS and SPAMMYTOKENS, after a day
or two (with Bayes auto-learning enabled) one can start noticing
interesting spammy tokens like empty or mostly-empty text/plain
parts, virus attachments, or hammy tokens like season's greeting
comics being passed around among friends these days, or business
documents.

Btw, initially I used digests directly as bayes tokens. Which is
mostly fine, except in case of empty of mostly empty MIME parts,
where it seemes more appropriate to distinguish for example
and empty text/plan from an empty text/html and an empty text/xml.
So I ended up with a bayes token consisting of a MIME part digest
concatenated with a Content-Type of the MIME part, which now makes
more sense.

During testing a couple of inconsistencies were discovered, like
in non-compliant QP decoding in MS::Utils (now fixed), or mangling
of Content-Type containing dots (now fixed), or breakage done
intentionally by MIME parses in MS::Message (like splitting long
lines into multiple lines, deleting sequences of more than 20 empty
lines) - which I have not touched, but warrants reconsideration.
Also it seems that complete first sections of delivery-reports are
being discarded by MIME parser - this needs to be investigated.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] New: Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Reply via email to