[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

bugzilla-daemon Tue, 23 Dec 2014 16:26:08 -0800

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=7115


--- Comment #6 from Mark Martinec <[email protected]> ---
Created attachment 5263
  --> https://issues.apache.org/SpamAssassin/attachment.cgi?id=5263&action=edit
added configurability

> > I'd REALLY like to see this extra tokenizing as a switchable option.
> Will do something along these lines.

Here it comes. Adds a config option, and conditionalizes sources
of input to Bayes. Most of the diff is due to indentation change,
consistency of variable names, and some cosmetics.

This is the added documentation (man Mail::SpamAssassin::Conf):


bayes_token_sources  (default: header visible invisible uri)

  Controls which sources in a mail message can contribute tokens
  (e.g. words, phrases, etc.) to a Bayes classifier. The argument is
  a space-separated list of keywords: header, visible, invisible,
  uri, mimepart), each of which may be prefixed by a no to indicate
  its exclusion. Additionally two reserved keywords are allowed: all
  and none (or: noall). The list of keywords is processed
  sequentially: a keyword all adds all available keywords to a set
  being built, a none or noall clears the set, other non-negated
  keywords are added to the set, and negated keywords are removed
  from the set. Keywords are case-insensitive.

  The default set is: header visible invisible uri, which is
  equivalent for example to: All NoMIMEpart. The reason why mimepart
  is not currently in a default set is that it is a newer source
  (introduced with SpamAssassin version 3.4.1) and not much
  experience has yet been gathered regarding its usefulness.

  See also option "bayes_ignore_header" for a fine-grained control on
  individual header fields under the umbrella of a more general
  keyword header here.

  Keywords imply the following data sources:

    header - tokens collected from a message header section
    visible - words from visible text (plain or HTML) in a message body
    invisible - hidden/invisible text in HTML parts of a message body
    uri - URIs collected from a message body
    mimepart - digests (hashes) of all MIME parts (textual or non-
      textual) of a message, computed after Base64 and quoted-printable
      decoding, suffixed by their Content-Type
    all - adds all the above keywords to the set being assembled
    none or noall - removes all keywords from the set

  The "bayes_token_sources" directive may appear multiple times, its
  keywords are interpreted sequentially, adding or removing items
  from the final set as they appear in their order in
  "bayes_token_sources" directive(s).

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7115] Adding SHA digests of MIME parts as Bayes tokens allows bayes to 'see' non-textual content

Reply via email to