I was just trolling through lib/Mail/SpamAssassin/Plugin/Bayes.pm
looking at the tokenizing code.  There are a lot of headers we ignore,
and I'm not sure if they're so wise to ignore.

Specifically:

Subject is specifically called out by Paul Graham's works as useful. 
(There is a comment saying "not worth a tiny gain vs. to db size
increase" for it.  Is this still relevant given today's increased
computing resources?)

Date is excluded due to the possibility of differences between spam and
ham training sets.  That's a valid point, but we lose oddities like time
zones of -0530 (India is +0530, there is no negative increment off the
hour) and alternate formatting, not to mention discriminating against
different time zones, which should be almost as good as discriminating
on countries (something Bayes can make safe while SA rules/plugins can't!).

The whole List headers section (comment says: "ignore. a spamfiltering
mailing list will become a nonspam sign."  I don't understand that
sentence).


I also see some Bayes tweaks that reference the SA-dev mailing list
archives but pre-date the Apache mailing lists (which launched around
New Years 2004).  Are these archives still around somewhere?  Where?


The reason I did this digging was to investigate how we tokenize URLs. 
It looks like we take the URL wholesale, and then tokenize the "domain"
by peeling off each dot left to right:

    if ($region == 1 || $region == 2) {
      if (CHEW_BODY_MAILADDRS && $token =~ /\S\@\S/i) {
        push (@rettokens, $self->_tokenize_mail_addrs ($token));
      }
      elsif (CHEW_BODY_URIS && $token =~ /\S\.[a-z]/i) {
        push (@rettokens, "UD:".$token); # the full token
        my $bit = $token; while ($bit =~ s/^[^\.]+\.(.+)$/$1/gs) {
          push (@rettokens, "UD:".$1); # UD = URL domain
        }
      }
    }

(CHEW_BODY_MAILADDRS is a tweak whose notes are archived at May 12 2003
on the old "SpamAssassin-devel archives")

So, given http://www.foo-bar.example.net/baz.pl?file=oops.txt&id=7, we
get the following tokens:

  * UD:http://www.foo-bar.example.net/baz.pl?file=oops.txt&id=7
  * UD:foo-bar.example.net/baz.pl?file=oops.txt&id=7
  * UD:example.net/baz.pl?file=oops.txt&id=7
  * UD:net/baz.pl?file=oops.txt&id=7
  * UD:pl?file=oops.txt&id=7
  * UD:txt&id=7

I think it might make sense to instead use a more logical break-down,
such as:  protocol, domain, TLD, port, path, search, and hash (inspired
by JavaScript's window.location
<https://developer.mozilla.org/en-US/docs/DOM/window.location#Properties> 
properties). 
I am proposing these tokens from that domain:

  * UD:http://www.foo-bar.example.net/baz.pl?file=oops.txt&id=7
  * UDpro:http
  * UDdom:www.foo-bar.example.net
  * UDdom:www
  * UDdom:foo
  * UDdom:bar
  * UDdom:example.net (extract this part with the code we already have
    for URIBL lookups)
  * UDdom:example
  * UDdoms:2 (counting the levels under the registered domain; a.b.co.uk
    is 2, not 3)
  * UDtld:net
  * UDp:baz.pl
  * UDp:baz
  * UDp:pl
  * UDs:file=oops.txt&id=7
  * UDs:file=oops.txt
  * UDs:file
  * UDs:oops.txt
  * UDs:oops
  * UDs:txt
  * UDs:id=7
  * UDs:id (too short, skip)
  * UDs:7 (too short, skip)

UDport:8080 and UDh:blah would be extracted from
http://example.com:8080#blah, UDport:443 would be extracted from
https://example.com:443 since it is explicitly included.

Another idea:  further break UDs into attribute and value.  So instead
of the UDs lines above, we'd have:

  * UDs:file=oops.txt&id=7
  * UDa:file
  * UDv:oops.txt
  * UDv:oops
  * UDv:txt
  * UDa:id
  * UDv:7 (too short, skip)
  * UDas:2 (counted number of attributes)
  * UDvs:2 (number of values (including skipped),
    http://example.com?a=&b&c= has UDas:3 and UDvs:0)

That will create a LOT more tokens than we have now, so we'll have to
test capacity and make sure the db can handle it.  That may also suggest
increasing the minimum token counts from 200 hams and 200 spams.

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to