It appears that some of the bb* corpora are extremely old and no longer representative of modern mail. Would anyone object if I went ahead and cleaned it up a bit? Proposed changes below. Yes, this would shrink the ham sample size, but my active masscheck recruiting should grow that, and I think we're better off with quality data from more recent ham than quantity of old ham.
net-bb-doc Spam messages Score range Ham messages Score range in 2008 0 1 (0%) [2,2] TOTAL: 0 1 (0%) [2,2] Remove. net-bb-fredt Spam messages Score range Ham messages Score range in 2005 0 108 (0%) [0,5] in 2006 0 459 (0%) [0,10] TOTAL: 0 567 (0%) [0,10] Remove. net-bb-jhardin Spam messages Score range Ham messages Score range in 1997 0 7 (0%) [1,3] in 1998 0 18 (0%) [0,6] in 1999 0 166 (0%) [0,6] in 2000 0 158 (0%) [0,7] in 2001 0 262 (0%) [-2,6] in 2002 0 153 (0%) [-2,5] in 2003 0 56 (0%) [-2,3] in 2004 0 39 (0%) [0,3] in 2005 0 564 (0%) [-2,6] in 2006 0 785 (0%) [-2,10] in 2007 0 841 (0%) [-1,9] in 2008 0 958 (0%) [-4,7] in 2009 0 1392 (0%) [-6,8] in 2010-01 0 117 (0%) [-6,4] in 2010-02 0 95 (0%) [-3,3] in 2010-03 0 124 (0%) [-6,6] in 2010-04 0 141 (0%) [-6,3] in 2010-05 0 131 (0%) [-5,3] in 2010-06 0 194 (0%) [-5,3] in 2010-07 59 (0%) [-3,27] 190 (0%) [-5,3] in 2010-08 117 (0%) [-2,25] 146 (0%) [-4,4] in 2010-09 135 (0%) [-1,22] 130 (0%) [-3,4] in 2010-10 139 (0%) [-4,26] 94 (0%) [-3,2] in 2010-11 173 (0%) [-4,23] 62 (0%) [-3,2] in 2010-12 185 (0%) [-4,27] 56 (0%) [0,2] in 2011-01 27 (0%) [-4,17] 14 (0%) [0,1] TOTAL: 835 (0%) [-4,27] 6893 (2%) [-6,10] jhardin, perhaps remove 1997-2006? net-bb-trec_enron Spam messages Score range Ham messages Score range in 2001 0 27880 (11%) [-4,17] in 2002 0 10681 (4%) [-3,11] in 2008 0 285 (0%) [0,4] TOTAL: 0 38846 (16%) [-4,17] Remove all. This is throwing off some of the non-reuse DNSBL tests. net-bb-jm Spam messages Score range Ham messages Score range in 2006 0 30130 (12%) [-14,6] in 2007 0 39088 (16%) [-15,7] in 2008 0 24619 (10%) [-15,8] in 2009 0 5322 (2%) [-14,7] JM, remove 2006? net-bb-zmi Spam messages Score range Ham messages Score range in 2008 0 1 (0%) [2,2] TOTAL: 0 1 (0%) [2,2] Remove. Warren
