It appears that some of the bb* corpora are extremely old and no
longer representative of modern mail.  Would anyone object if I went
ahead and cleaned it up a bit?   Proposed changes below.  Yes, this
would shrink the ham sample size, but my active masscheck recruiting
should grow that, and I think we're better off with quality data from
more recent ham than quantity of old ham.

net-bb-doc       Spam messages    Score range    Ham messages     Score range
  in 2008              0                               1   (0%)   [2,2]
  TOTAL:               0                               1   (0%)   [2,2]

Remove.

net-bb-fredt     Spam messages    Score range    Ham messages     Score range
  in 2005              0                             108   (0%)   [0,5]
  in 2006              0                             459   (0%)   [0,10]
  TOTAL:               0                             567   (0%)   [0,10]

Remove.

net-bb-jhardin   Spam messages    Score range    Ham messages     Score range
  in 1997              0                               7   (0%)   [1,3]
  in 1998              0                              18   (0%)   [0,6]
  in 1999              0                             166   (0%)   [0,6]
  in 2000              0                             158   (0%)   [0,7]
  in 2001              0                             262   (0%)   [-2,6]
  in 2002              0                             153   (0%)   [-2,5]
  in 2003              0                              56   (0%)   [-2,3]
  in 2004              0                              39   (0%)   [0,3]
  in 2005              0                             564   (0%)   [-2,6]
  in 2006              0                             785   (0%)   [-2,10]
  in 2007              0                             841   (0%)   [-1,9]
  in 2008              0                             958   (0%)   [-4,7]
  in 2009              0                            1392   (0%)   [-6,8]
  in 2010-01           0                             117   (0%)   [-6,4]
  in 2010-02           0                              95   (0%)   [-3,3]
  in 2010-03           0                             124   (0%)   [-6,6]
  in 2010-04           0                             141   (0%)   [-6,3]
  in 2010-05           0                             131   (0%)   [-5,3]
  in 2010-06           0                             194   (0%)   [-5,3]
  in 2010-07          59   (0%)   [-3,27]            190   (0%)   [-5,3]
  in 2010-08         117   (0%)   [-2,25]            146   (0%)   [-4,4]
  in 2010-09         135   (0%)   [-1,22]            130   (0%)   [-3,4]
  in 2010-10         139   (0%)   [-4,26]             94   (0%)   [-3,2]
  in 2010-11         173   (0%)   [-4,23]             62   (0%)   [-3,2]
  in 2010-12         185   (0%)   [-4,27]             56   (0%)   [0,2]
  in 2011-01          27   (0%)   [-4,17]             14   (0%)   [0,1]
  TOTAL:             835   (0%)   [-4,27]           6893   (2%)   [-6,10]

jhardin, perhaps remove 1997-2006?

net-bb-trec_enron Spam messages    Score range    Ham messages
Score range
  in 2001              0                           27880  (11%)   [-4,17]
  in 2002              0                           10681   (4%)   [-3,11]
  in 2008              0                             285   (0%)   [0,4]
  TOTAL:               0                           38846  (16%)   [-4,17]

Remove all.  This is throwing off some of the non-reuse DNSBL tests.

net-bb-jm        Spam messages    Score range    Ham messages     Score range
  in 2006              0                           30130  (12%)   [-14,6]
  in 2007              0                           39088  (16%)   [-15,7]
  in 2008              0                           24619  (10%)   [-15,8]
  in 2009              0                            5322   (2%)   [-14,7]

JM, remove 2006?

net-bb-zmi       Spam messages    Score range    Ham messages     Score range
  in 2008              0                               1   (0%)   [2,2]
  TOTAL:               0                               1   (0%)   [2,2]

Remove.

Warren

Reply via email to