As has already been mentioned, Theo is patching SpamAssassin to
locally whitelist some common whitehat URI domains for use in
URIBL (which typically uses sbl and SURBL data) .  This will
prevent DNS queries on the whitehats and probably save some
very significant traffic on the SURBL and spamhaus, etc. name
servers.

In order to get some better whitehat data, we increased the
sampling of DNS queries on a name server from 32k (2k every 3
hours for 2 days) to 1.2 million (10k every 2 hours for 10
days).  We're only about a third of the way through the initial
10 days, so the stats are still building up, but the current
results are at:

  http://www.surbl.org/dns-queries.whitelist.counts.txt
  http://www.surbl.org/dns-queries.blocklist.counts.txt

(These files have been mentioned before, but they're starting
to get a lot more data behind them now.)

Something else which was probably suggestion before, but which we
*hadn't looked at before* were the DNS queries that *don't match*
either our blocklists or whitelists.  Those, sorted in order
of decreasing frequency are:

  http://www.surbl.org/dns-queries.unmatched.30thpercentile.txt

That's the top 30th percentile of them (about 3.6k records).
The full list of unique domains and IPs with frequencies
(about 110k records) is at:

  http://www.surbl.org/dns-queries.unmatched.count.txt

Taking a look the top few of these:

333     56.227.117.38
211     internet.e
196     wwwlowmortnow.info
123     beliefnet.com
119     grisoft.com)
107     specialmax.net
99      democrats.org
99      115.14.249.209
96      and
90      c
82      centrport.net
78      charter.net
73      zdnet.com
65      cf.st
63      nuri1.net
62      red-hot1.com
62      imomentum.net
61      justsaywow.com
60      173.213.115.211
59      www
57      www.cool-loanco.kr
57      superduperfun.com
53      healthinsrus.com
51      e-directnet.net
51      agoramail.net
50      tmcs.net
50      latimes.com
50      dw.com.com
50      168.228.186.64
49      iscsimg.com
48      livedaily.com
48      eversave.com
47      1shoppingcart.com
46      srvimg.com
46      realone.com
46      goodnewsdelivery.com
45      rockbridgemedia.com
45      purdue.edu

It's clear that a few are errors, probably due to problems in the
applications using SURBLs.  Yet it's probably useful to not
suppress the errors so that the programs can be updated to handle
them correctly.  (Unfortunately the source URIs generating the
errors are not directly available, but they may be identifiable
in other ways if anyone would like to look for them.)

Minus the errors, I fed this list into Ryan's GetURI to see what
it could find.  The results are at:

  http://ry.ca/cgi-bin/geturi.cgi?id=ham-5lCTzHkxan3xE38RKHa0vx

Quite a few appear ok to whitelist, like democrats.org,
pudue.edu, latimes.com, charter.net, zdnet.com, etc. and I'll
probably go ahead and whitelist obvious ones like these, so some
of these will probably be off this "unmatched" list and onto the
whitelist hits by the time you read this.

Nonetheless I recommend we all take a look at this unmatched
list periodically, especially the top few dozen, to look for
potential domains to whitelist or blacklist.  These most
frequently appearing domains are probably good candidates for
one or the other.

Since this is a list of the unknown "wild" domains coming
from live, real-world message URIs, it may be another useful
and different source of some data.

Cheers,

Jeff C.
--
"If it appears in hams, then don't list it."

Reply via email to