>-----Original Message----- >From: Jeff Chan [mailto:[EMAIL PROTECTED] >Sent: Wednesday, May 11, 2005 3:59 AM >To: Spamassassin Devel List >Subject: Re: ws.surbl.org scores before and after Chris Santerre data >gone? > > >On Tuesday, May 10, 2005, 1:34:27 PM, Theo Dinter wrote: >>> > On Mon, May 02, 2005 at 03:16:11PM -0700, Jeff Chan wrote: >>> >> If so can you provide a before and after ham/spam summary as of >>> >> say a week ago and now-ish? >>> >>> > For SpamAssassin, our last weekly run (does net checks) >failed due to a code >>> > issue. Oops. In theory, the next weekly run (on >Saturdays) should occur and >>> > we can compare it to the results from 2 weeks ago. > >> Ok, the two run sizes are a bit different, but you can get >general stats from >> this I think: > >> Previous run (4/23): > >> OVERALL% SPAM% HAM% S/O RANK SCORE NAME >> 182168 109007 73161 0.598 0.00 0.00 (all messages) >> 12.750 21.3041 0.0055 1.000 0.99 0.00 URIBL_SC_SURBL >> 36.463 60.9135 0.0328 0.999 0.98 0.00 URIBL_JP_SURBL >> 9.809 16.3843 0.0109 0.999 0.97 0.00 URIBL_AB_SURBL >> 36.982 61.7355 0.1011 0.998 0.89 0.00 URIBL_WS_SURBL >> 38.506 64.2683 0.1203 0.998 0.87 0.00 URIBL_OB_SURBL >> 0.211 0.3532 0.0000 1.000 0.66 0.00 URIBL_PH_SURBL > >> Latest run (5/8): > >> OVERALL% SPAM% HAM% S/O RANK SCORE NAME >> 339239 240537 98702 0.709 0.00 0.00 (all messages) >> 100.000 70.9049 29.0951 0.709 0.00 0.00 (all >messages as %) >> 13.111 18.4895 0.0020 1.000 0.98 0.00 URIBL_SC_SURBL >> 37.333 52.6451 0.0172 1.000 0.98 0.00 URIBL_JP_SURBL >> 8.836 12.4600 0.0041 1.000 0.97 0.00 URIBL_AB_SURBL >> 38.140 53.7672 0.0567 0.999 0.91 0.00 URIBL_OB_SURBL >> 40.770 57.4652 0.0841 0.999 0.87 0.00 URIBL_WS_SURBL >> 0.215 0.3035 0.0000 1.000 0.61 0.00 URIBL_PH_SURBL > >Thanks. The differing corpora sizes makes it difficult to >compare however. For example the 5/8 spam count is more than >double, but the ham count is like 35% more. Therefore the >percentages are not directly comparable. > >Assuming the percentages in the SPAM and HAM columns represent >percentages of hits within those columns, then here are the >HAM percentages multiplied by the ham count at the top of the >column for the number of ham hits (counts) per list: > >4/23 > >NAME ham hits? >URIBL_SC_SURBL 4 >URIBL_JP_SURBL 24 >URIBL_AB_SURBL 8 >URIBL_WS_SURBL 74 >URIBL_OB_SURBL 88 >URIBL_PH_SURBL 0 > > >5/8 > >NAME >URIBL_SC_SURBL 2 >URIBL_JP_SURBL 17 >URIBL_AB_SURBL 4 >URIBL_WS_SURBL 83 >URIBL_OB_SURBL 56 >URIBL_PH_SURBL 0 > >(If my assumption is wrong, please let me know how to correct >it.) > >On a 35% larger ham corpus, WS hit 84 hams versus 74 before. >In a sense that's a step in the wrong direction, but the >differing ham corpora make conclusions difficult.
Remove my submissions and FPs go up? :) Are the hams confirmed? We ran a check on black.uribl.com and found a poor FP rate. Come to find out, they were all spams the didn't score high enough. That turned the numbers right where they should have been. Jeff, maybe you should temp disable other peoples submissions, rerun the test, and see where these FPs are coming from? Its the only way I can think to find them. --Chris
