On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote: > Hi all, > > (Maybe a monospaced font makes it more readable) > There seems to be some questions about the tokenizer to use (for me > to), > so I thought it would be nice to have some statistics. > > First about the setup I've chosen: > It's a clean install with no training done at al. I've made 6 > directories > containing spam and 6 containing ham. I thought I've read somewhere > to > train 2 ham agains 1 spam so in those directories the number of files > are: > ham-01: 500 > ham-02: 1000 > ham-03: 1000 > ham-04: 1000 > ham-05: 500 > ham-final: 200 > spam-01: 250 > spam-02: 500 > spam-03: 500 > spam-04: 500 > spam-05: 250 > spam-final: 100 > Totaling: 6300 messages, 2100 spam and 4200 ham. > > Some other info: Algorithm graham burton, and a MySQL database as > backend. There were only 55 'recent' spam messages. They came from my > gmail-account spambox. All other mails were training mails found > somewhere on the internet dating 2003, 2004 and 2005. In the final > batch > there were 10 of the recent spams, the other 45 were spread in the > other > batches. > This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There > were > no other VM's running. > After that I trained using the word, chain, osb and sbph. I hope this > gives me the insight I want. > > So, now for the real deal: > > Token / batch: 01 02 03 04 05 final total tokens@db > word: FP: 0 0 2 2 0 1 5 205234 > FN: 100 94 31 28 26 3 282 > Time sec: 37 58 63 70 34 14 276 sec > > chain: FP: 0 0 3 2 0 1 6 825549 > FN: 77 59 10 10 14 3 173 > Time: 46 79 90 111 46 27 399 sec > > osb: FP: 1 1 3 3 0 0 8 2741757 > FN: 74 73 18 11 13 4 193 > Time: 80 126 218 469 397 142 1432 sec > > sbph: FP: 1 1 2 6 0 0 10 13904366 > FN: 65 60 10 6 10 3 154 > Time: 544 3272 6843 8936 3532 1348 6h47m55s > > Using osb my database grew up to 299Mb. Using sbph my database grew > up to > 741Mb. The last collumn shows the number of tokens produced. > Some questions: 1) What learning method have you used? TEFT? TOE? TUM? 2) Are those batches cumulative or have you wiped the data after each training batch? 3) What preferences have you in place? 4) Have you done anything in between the training batches? (stuff like using dspam_clean etc) 5) Have you used DNSBL within DSPAM? 6) Are the SPAM and HAM messages in the same language?
> What is this all telling me... > > That i'm a little disapointed. osb gave me more FP and FN than the > chain > tokenizer did. > Usually OSB/SBPH result in less training (FP/FN) in the long run while WORD will constantly require training and CHAIN is somewhere in between WORD and OSB/SBPH. > Luckely in the final batch there were no FP's. That's the > one thing people can't live with. This also means you will probably > need > arround 7500 ham messages and 3500+ (recent) spam messages to get a > proper training. The best training will be (dûh) using real mail. > I would not underwrite that statement. You can not conclude from your test that in general one needs 7.5K messages to get a decent result. It all depends what you train and how you train. > What more have I learned: If you are using the preferred tokenizer > (osb, > if I followed the maillinglist right) the time needed to proces a > message > increases pretty much. Looking at the final batch there is an > increase > from chain to osb by 115 seconds, 27 to 142. > This strongly depends on the used training method. I think you used TEFT on both algorithms. Right? > The biggest wait was using sbph, luckily i went out to jam with the > band > ;). Stuppid as I am, I didn't automaticly start the final batch, so > at > this moment I am waiting again. > > My personal experience training with this mail batch (i've got > another > 3K+ trainingmessages) is not to good :(. Using sbph my db filed up my > disc. I did not expect it to grow that big. So, I started using osb, > but > then I got too much FP's, and as stated before: a lot of unhappy > faces. > > Well, in the final batch there are only two with no FP's, the osb and > sbph. sbph is 'the best' but not realy suited for bussy systems. Or > you > just buy more power... Maybe I will just train a lot more and return > to > the chain. One thing I noticed: average time per message in the final > batch. > SBPH is usually best done using the Hash storage driver. Using a RDBMS for SBPH is sub optimal. > word: 0.04667 seconds > chain: 0.09 seconds > osb: 0.47333 seconds > sbph: 4.49333 seconds > > Hope it helps someone. > > Greetings, > > Ed van der Salm > > The Netherlands > Amstelveen -- Kind Regards from Switzerland, Stevan Bajić ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user