Following up on the earlier check of DMOZ domains against SURBL
data, I applied some of Quinlan's suggestions and grabbed
different revisions of the DMOZ data and joined (intersected)
them in order to eliminate changes due to things like editor-
removed spammer/abuser domains, etc.  This also means that new
additions are ignored, but the corpus is so large that the
benefits of capturing editor removals is probably more important.

Snapshots are the three most recent available from 9/9/04, 9/25/04
and 10/7/04 (file dates) are intersected, resulting in fewer
records (only those that are constant across all three snapshots):

  http://rdf.dmoz.org/rdf/archive/

Here are the line, word and character counts:

 2300851 2300851 38065969 dmoz.srt

    1169   11690  123909 dmoz-blocklist.summed.txt
    1141   11410  120860 dmoz-blocklist.ws
    1169    1169   17977 dmoz-blocklist.txt
    7394    7394   97011 dmoz-whitelist.txt

The above are revised versions of joins on blocklists, with
list info, with ws hits, with just domains, and against the
whitelist.

These are in the whitelists directory, though we are still
*not* applying the dmoz domains as whitelists:

  http://spamcheck.freeapp.net/whitelists/

The previous dmoz (9/25 version only IIRC) and hits are archived
as:

 2326173 2326173 38494184 dmoz.srt1

    1338   13380  141946 dmoz-blocklist1.summed.txt
    1173   11730  124298 dmoz-blocklist1.ws
    1338    1338   20533 dmoz-blocklist1.txt
    7375    7375   96720 dmoz-whitelist1.txt
__

I was also able to grab 4 snapshots of the wikipedia, all
sections (all languages).  Only the two most recent snapshots
had comparable numbers of sections so I used only those two.
Using only two sets may be ok, since these are much smaller
corpora and there's probably more hand-editing of them.

  http://download.wikimedia.org/

Where dmoz has about 2.3 million domains, wikipedia has about
174k domains:

  173828  173828 2633441 wikipedia.srt

     188    1880   19631 wikipedia-blocklist.summed.txt
     188     188    2713 wikipedia-blocklist.txt
    2437    2437   29581 wikipedia-whitelist.txt
__

I also took the intersection of the three dmoz snapshots and
the two wikipedia snapshots to get a smaller list containing only
the ~102k domains found in both wikipedia and dmoz:

  101619  101619 1498653 wikipedia-dmoz.srt

     116    1160   11928 wikipedia-dmoz-blocklist.summed.txt
     116     116    1591 wikipedia-dmoz-blocklist.txt
    2223    2223   26854 wikipedia-dmoz-whitelist.txt

This intersection of dmoz and wikipedia domains probably
represents the best hope for large whitelist additions so far.
This is probably imperfect data, but at least it has some checks
by human editors and techniques applied to reduce spammer
domains, including comparing the snapshots over time and
intersecting the two relatively unrelated sources.

Can anyone think of any other hand-edited databases, directories,
encyclopedias, etc of URIs of hopefully legitimate (non-spammer)
domains that are publically available?  Please think about it
a little, and speak up!

While 102k domains isn't nearly as large as the 2.3M in dmoz,
it's certainly more than the 12k or so whitelist records we
currently have.  How does the intersected list look as a
potential whitelist?

  http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz.srt

Please also take a look at these blocklist hits (potential FPs)
and share what you think:

  http://spamcheck.freeapp.net/whitelists/wikipedia-dmoz-blocklist.summed.txt

Would there be many FNs (missed spams) if we whitelisted all
of these?  In other words are these all truly False Positives?
If not, which ones do you feel are true spammers and why.

Jeff C.
--
"If it appears in hams, then don't list it."

Reply via email to