On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote: > Since I'm not behind the machine (it's at home) for now only the info > I > know. > Okay.
> I installed a virtual machine which i restored after each run. > 'each run' = after 5 batches + the final run OR 'each run' = after switching tokenizer? > I left as > much settings as possible at default, so the learning method was TEFT > (that's default, right?) > Yes. TEFT is default. TEFT is good suited for the dull tokenizers like WORD and CHAIN (default). When using one of the more intelligent tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work too). > and all other settings were untouched (apart > from using MySQL db). Since i wanted to check dspam tokenizers, I > left > out all other stuff like DNSBL and AV. > Okay. > I have not realy checked (myself I mean) all spams/hams they came > from a > file which could be used for Mailscanner training. Looking at the > webinterface, it looks like the all are English. It also looked like > they > were mail from maillinglists etceterra. If someone has a nice batch > of > 'real' mail to throw at it, just send me the zip... :) > Then addressing this message to the DSPAM mailing list would be helpful. I think you by mistake just wrote the message to me. > (It seems like my webmail-client doesn't allow 'nice' inserts, so > I've > put my comments between ------------------------) > Okay. > Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić: > >> On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote: >> >>> Hi all, >>> >>> (Maybe a monospaced font makes it more readable) >>> There seems to be some questions about the tokenizer to use (for me >>> to), >>> so I thought it would be nice to have some statistics. >>> >>> First about the setup I've chosen: >>> It's a clean install with no training done at al. I've made 6 >>> directories >>> containing spam and 6 containing ham. I thought I've read somewhere >>> to >>> train 2 ham agains 1 spam so in those directories the number of >>> files >>> are: >>> ham-01: 500 >>> ham-02: 1000 >>> ham-03: 1000 >>> ham-04: 1000 >>> ham-05: 500 >>> ham-final: 200 >>> spam-01: 250 >>> spam-02: 500 >>> spam-03: 500 >>> spam-04: 500 >>> spam-05: 250 >>> spam-final: 100 >>> Totaling: 6300 messages, 2100 spam and 4200 ham. >>> >>> Some other info: Algorithm graham burton, and a MySQL database as >>> backend. There were only 55 'recent' spam messages. They came from >>> my >>> gmail-account spambox. All other mails were training mails found >>> somewhere on the internet dating 2003, 2004 and 2005. In the final >>> batch >>> there were 10 of the recent spams, the other 45 were spread in the >>> other >>> batches. >>> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. >>> There >>> were >>> no other VM's running. >>> After that I trained using the word, chain, osb and sbph. I hope >>> this >>> gives me the insight I want. >>> >>> So, now for the real deal: >>> >>> Token / batch: 01 02 03 04 05 final total tokens@db >>> word: FP: 0 0 2 2 0 1 5 205234 >>> FN: 100 94 31 28 26 3 282 >>> Time sec: 37 58 63 70 34 14 276 sec >>> >>> chain: FP: 0 0 3 2 0 1 6 825549 >>> FN: 77 59 10 10 14 3 173 >>> Time: 46 79 90 111 46 27 399 sec >>> >>> osb: FP: 1 1 3 3 0 0 8 2741757 >>> FN: 74 73 18 11 13 4 193 >>> Time: 80 126 218 469 397 142 1432 sec >>> >>> sbph: FP: 1 1 2 6 0 0 10 13904366 >>> FN: 65 60 10 6 10 3 154 >>> Time: 544 3272 6843 8936 3532 1348 6h47m55s >>> >>> Using osb my database grew up to 299Mb. Using sbph my database grew >>> up to >>> 741Mb. The last collumn shows the number of tokens produced. >>> >> Some questions: >> 1) What learning method have you used? TEFT? TOE? TUM? >> 2) Are those batches cumulative or have you wiped the data after >> each >> training batch? >> 3) What preferences have you in place? >> 4) Have you done anything in between the training batches? (stuff >> like >> using dspam_clean etc) >> 5) Have you used DNSBL within DSPAM? >> 6) Are the SPAM and HAM messages in the same language? >> >> ------------------------ >> I think I answered these questions in my intro above, if you miss >> something, tell me, then I will look in to that. But that will be >> sometime late tonight. >> (So, for 6: I haven't realy looked...) >> ------------------------ >> Yes. You have done that in the above response. >>> What is this all telling me... >>> >>> That i'm a little disapointed. osb gave me more FP and FN than the >>> chain >>> tokenizer did. >>> >> Usually OSB/SBPH result in less training (FP/FN) in the long run >> while >> WORD will constantly require training and CHAIN is somewhere in >> between >> >> WORD and OSB/SBPH. >> >> ------------------------ >> I would agree with that. >> The reason for splitting all messages in batches was to see the how >> the >> ongoing training would change the results. And I think the figures >> say >> the same as you. Just like I expected btw (but not as big of a >> difference as I expected) >> ------------------------ >> >>> Luckely in the final batch there were no FP's. That's the >>> one thing people can't live with. This also means you will probably >>> need >>> arround 7500 ham messages and 3500+ (recent) spam messages to get a >>> proper training. The best training will be (dûh) using real mail. >>> >> I would not underwrite that statement. You can not conclude from >> your >> test that in general one needs 7.5K messages to get a decent result. >> It >> >> all depends what you train and how you train. >> >> ------------------------ >> OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't >> enough. And since I saw better results after more training I thought >> lets just shout some numbers... If somebody want's all messages i >> used, >> i can zip them and post it somewhere. >> ------------------------ >> >>> What more have I learned: If you are using the preferred tokenizer >>> (osb, >>> if I followed the maillinglist right) the time needed to proces a >>> message >>> increases pretty much. Looking at the final batch there is an >>> increase >>> from chain to osb by 115 seconds, 27 to 142. >>> >> This strongly depends on the used training method. I think you used >> TEFT on both algorithms. Right? >> ------------------------ >> Yep, I only used TEFT (That's the default, right?) Ow, one thing: >> dspam >> 3.9.1rc1 :) >> ------------------------ >> Yes. TEFT is the default. >>> The biggest wait was using sbph, luckily i went out to jam with the >>> band >>> ;). Stuppid as I am, I didn't automaticly start the final batch, so >>> at >>> this moment I am waiting again. >>> >>> My personal experience training with this mail batch (i've got >>> another >>> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up >>> my >>> disc. I did not expect it to grow that big. So, I started using >>> osb, >>> but >>> then I got too much FP's, and as stated before: a lot of unhappy >>> faces. >>> >>> Well, in the final batch there are only two with no FP's, the osb >>> and >>> sbph. sbph is 'the best' but not realy suited for bussy systems. Or >>> you >>> just buy more power... Maybe I will just train a lot more and >>> return >>> to >>> the chain. One thing I noticed: average time per message in the >>> final >>> batch. >>> >> SBPH is usually best done using the Hash storage driver. Using a >> RDBMS >> for SBPH is sub optimal. >> >> ------------------------ >> I kind off figured that out. But is the hash storage driver always >> faster? I used a DB because i wanted to see how much tokens would be >> created. So that was mainly for more statistics. If the hash storage >> driver is always faster, than a DB is not realy usefull for >> standalone >> servers i suppose. >> ------------------------ >> Hmm... technically the Hash driver is +/- nothing other than a memory mapped file. A RDBMS is way more than just that. So you can make your own conclusion which one should be technically faster. >>> word: 0.04667 seconds >>> chain: 0.09 seconds >>> osb: 0.47333 seconds >>> sbph: 4.49333 seconds >>> >>> Hope it helps someone. >>> >>> Greetings, >>> >>> Ed van der Salm >>> >>> The Netherlands >>> Amstelveen >> >> -- >> Kind Regards from Switzerland, >> >> Stevan Bajić >> >> ------------------------ >> Greetings! >> @ >> ------------------------ >> >> > ------------------------------------------------------------------------------ >> WhatsUp Gold - Download Free Network Management Software >> The most intuitive, comprehensive, and cost-effective network >> management toolset available today. Delivers lowest initial >> acquisition cost and overall TCO of any competing solution. >> http://p.sf.net/sfu/whatsupgold-sd [1] >> _______________________________________________ >> Dspam-user mailing list >> Dspam-user@lists.sourceforge.net [2] >> https://lists.sourceforge.net/lists/listinfo/dspam-user [3] -- Kind Regards from Switzerland, Stevan Bajić Links: ------ [1] http://p.sf.net/sfu/whatsupgold-sd [2] mailto:Dspam-user@lists.sourceforge.net [3] https://lists.sourceforge.net/lists/listinfo/dspam-user ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user