Hi, The files i used: http://www.vander-salm.nl/downloads/retrain-all.tar.bz2 I added my last dspam.conf to. The logfile is filled with the last results.
I have run the osb tokenizer using TOE, and the result were identical, even the number of tokens in the database. First run: Token / part: 01 02 03 04 05 final total tokens@db osb: FP: 1 1 3 3 0 0 8 2741757 FN: 74 73 18 11 13 4 193 After making the above file i retrained using the same script, then the results were different: Second run osb, not clearing the database: Token / part: 01 02 03 04 05 final total tokens@db osb: FP: 1 0 0 1 0 0 2 2741981 FN: 2 4 2 1 1 1 11 This is better, still one FN in the final 100 spam's Third run osb, not clearing the database: Token / part: 01 02 03 04 05 final total tokens@db osb: FP: 1 0 0 1 0 0 2 2742036 FN: 1 2 0 0 1 0 4 OK, 6 misses with 6300 mails. Makes me wonder which mails are so problematic. Yeah, the final batch had no misses. I will upload the last logfile: http://www.vander-salm.nl/downloads/logfile.txt. The time needed at the last run was about 30% less then the second run. More tomorrow! I think... Greetings, Ed. Op Donderdag, 05-05-2011 om 15:32 schreef Stevan Bajić: > On Thu, 05 May 2011 15:14:56 +0200, Ed van der Salm wrote: > > > Since it really looked like a mess, a repost, with the extra info > > added: > > > >> I installed a virtual machine which i restored after each run. > >> 'each run' = after 5 batches + the final run > >> OR > >> 'each run' = after switching tokenizer? > > > > I run all 6300 messages through the training, then i restored the > > machine > > to use another tokenizer and do it again. So the start was always a > > clean > > machine. > > > Thanks for the clarification. > > > >> TEFT is good suited for the dull tokenizers like WORD and CHAIN > >> (default). When using one of the more intelligent tokenizers > >> (aka: OSB/SBPH) then using TOE is better (TUM would work too). > > Should i do the training using TOE? > > > For OSB and SBPH you could try with TOE. That should deliver in the > long run better results than TEFT. > > > > Ah well, if i am home in time, I will change to TOE and rerun them > > all. > > More numbers are always good! > > > If you want you could send me the training data and I will do the tests > with my own training method and then post the results. > > One thing you could do as well is after you have done the whole > training then run a classification run over all 5 sets and include the > final set too and then record how many FP/FN you had. This would show > how well the training was in regards of classifying the same message set > AFTER the whole training. > > Another test you could do is you do the same training as now and after > you are finished with the training you go on and switch the two classes. > So you declare AFTER the whole training, every SPAM message to be HAM > and every HAM message to be SPAM. And then you do the training again and > look how quick the tokenizer is able to switch the tokens the other way > around. You run that learning until ALL messages are correctly > classified (aka: 0 FP and 0 FN). Good learning algorithm will not need > that much time/training to switch while bad algorithm will use a lot of > training to switch. Doing that kind of tests is where TOE shines > (compared to TEFT). TOE will use much less training while some messages > trained with TEFT will use an insane amount of re-training until they > switch their class. > > > > Greetings > > > > @ > > > -- > Kind Regards from Switzerland, > > Stevan Bajić > > > > --- Origineel bericht volgt --- > > ONDERWERP: Re: [Dspam-user] Some tokenizer statistics > > VAN: Ed van der Salm > > NAAR: "Stevan Bajić" > > DATUM: 05-05-2011 14:14 > > > > Since I'm not behind the machine (it's at home) for now only the info > > I > > know. > > > > I installed a virtual machine which i restored after each run. I left > > as > > much settings as possible at default, so the learning method was TEFT > > (that's default, right?) and all other settings were untouched (apart > > from using MySQL db). Since i wanted to check dspam tokenizers, I > > left > > out all other stuff like DNSBL and AV. > > I have not realy checked (myself I mean) all spams/hams they came > > from a > > file which could be used for Mailscanner training. Looking at the > > webinterface, it looks like the all are English. It also looked like > > they > > were mail from maillinglists etceterra. If someone has a nice batch > > of > > 'real' mail to throw at it, just send me the zip... :) > > > > (It seems like my webmail-client doesn't allow 'nice' inserts, so > > I've > > put my comments between ------------------------) > > > > Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić: > > > >> On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote: > >> > >>> Hi all, > >>> > >>> (Maybe a monospaced font makes it more readable) > >>> There seems to be some questions about the tokenizer to use (for me > >>> to), > >>> so I thought it would be nice to have some statistics. > >>> > >>> First about the setup I've chosen: > >>> It's a clean install with no training done at al. I've made 6 > >>> directories > >>> containing spam and 6 containing ham. I thought I've read somewhere > >>> to > >>> train 2 ham agains 1 spam so in those directories the number of > >>> files > >>> are: > >>> ham-01: 500 > >>> ham-02: 1000 > >>> ham-03: 1000 > >>> ham-04: 1000 > >>> ham-05: 500 > >>> ham-final: 200 > >>> spam-01: 250 > >>> spam-02: 500 > >>> spam-03: 500 > >>> spam-04: 500 > >>> spam-05: 250 > >>> spam-final: 100 > >>> Totaling: 6300 messages, 2100 spam and 4200 ham. > >>> > >>> Some other info: Algorithm graham burton, and a MySQL database as > >>> backend. There were only 55 'recent' spam messages. They came from > >>> my > >>> gmail-account spambox. All other mails were training mails found > >>> somewhere on the internet dating 2003, 2004 and 2005. In the final > >>> batch > >>> there were 10 of the recent spams, the other 45 were spread in the > >>> other > >>> batches. > >>> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. > >>> There > >>> were > >>> no other VM's running. > >>> After that I trained using the word, chain, osb and sbph. I hope > >>> this > >>> gives me the insight I want. > >>> > >>> So, now for the real deal: > >>> > >>> Token / batch: 01 02 03 04 05 final total tokens@db > >>> word: FP: 0 0 2 2 0 1 5 205234 > >>> FN: 100 94 31 28 26 3 282 > >>> Time sec: 37 58 63 70 34 14 276 sec > >>> > >>> chain: FP: 0 0 3 2 0 1 6 825549 > >>> FN: 77 59 10 10 14 3 173 > >>> Time: 46 79 90 111 46 27 399 sec > >>> > >>> osb: FP: 1 1 3 3 0 0 8 2741757 > >>> FN: 74 73 18 11 13 4 193 > >>> Time: 80 126 218 469 397 142 1432 sec > >>> > >>> sbph: FP: 1 1 2 6 0 0 10 13904366 > >>> FN: 65 60 10 6 10 3 154 > >>> Time: 544 3272 6843 8936 3532 1348 6h47m55s > >>> > >>> Using osb my database grew up to 299Mb. Using sbph my database grew > >>> up to > >>> 741Mb. The last collumn shows the number of tokens produced. > >>> > >> Some questions: > >> 1) What learning method have you used? TEFT? TOE? TUM? > >> 2) Are those batches cumulative or have you wiped the data after > >> each > >> training batch? > >> 3) What preferences have you in place? > >> 4) Have you done anything in between the training batches? (stuff > >> like > >> using dspam_clean etc) > >> 5) Have you used DNSBL within DSPAM? > >> 6) Are the SPAM and HAM messages in the same language? > >> > >> ------------------------ > >> I think I answered these questions in my intro above, if you miss > >> something, tell me, then I will look in to that. But that will be > >> sometime late tonight. > >> (So, for 6: I haven't realy looked...) > >> ------------------------ > >> > >>> What is this all telling me... > >>> > >>> That i'm a little disapointed. osb gave me more FP and FN than the > >>> chain > >>> tokenizer did. > >>> > >> Usually OSB/SBPH result in less training (FP/FN) in the long run > >> while > >> WORD will constantly require training and CHAIN is somewhere in > >> between > >> > >> WORD and OSB/SBPH. > >> > >> ------------------------ > >> I would agree with that. > >> The reason for splitting all messages in batches was to see the how > >> the > >> ongoing training would change the results. And I think the figures > >> say > >> the same as you. Just like I expected btw (but not as big of a > >> difference as I expected) > >> ------------------------ > >> > >>> Luckely in the final batch there were no FP's. That's the > >>> one thing people can't live with. This also means you will probably > >>> need > >>> arround 7500 ham messages and 3500+ (recent) spam messages to get a > >>> proper training. The best training will be (dûh) using real mail. > >>> > >> I would not underwrite that statement. You can not conclude from > >> your > >> test that in general one needs 7.5K messages to get a decent result. > >> It > >> > >> all depends what you train and how you train. > >> > >> ------------------------ > >> OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't > >> enough. And since I saw better results after more training I thought > >> lets just shout some numbers... If somebody want's all messages i > >> used, > >> i can zip them and post it somewhere. > >> ------------------------ > >> > >>> What more have I learned: If you are using the preferred tokenizer > >>> (osb, > >>> if I followed the maillinglist right) the time needed to proces a > >>> message > >>> increases pretty much. Looking at the final batch there is an > >>> increase > >>> from chain to osb by 115 seconds, 27 to 142. > >>> > >> This strongly depends on the used training method. I think you used > >> TEFT on both algorithms. Right? > >> ------------------------ > >> Yep, I only used TEFT (That's the default, right?) Ow, one thing: > >> dspam > >> 3.9.1rc1 :) > >> ------------------------ > >> > >>> The biggest wait was using sbph, luckily i went out to jam with the > >>> band > >>> ;). Stuppid as I am, I didn't automaticly start the final batch, so > >>> at > >>> this moment I am waiting again. > >>> > >>> My personal experience training with this mail batch (i've got > >>> another > >>> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up > >>> my > >>> disc. I did not expect it to grow that big. So, I started using > >>> osb, > >>> but > >>> then I got too much FP's, and as stated before: a lot of unhappy > >>> faces. > >>> > >>> Well, in the final batch there are only two with no FP's, the osb > >>> and > >>> sbph. sbph is 'the best' but not realy suited for bussy systems. Or > >>> you > >>> just buy more power... Maybe I will just train a lot more and > >>> return > >>> to > >>> the chain. One thing I noticed: average time per message in the > >>> final > >>> batch. > >>> > >> SBPH is usually best done using the Hash storage driver. Using a > >> RDBMS > >> for SBPH is sub optimal. > >> > >> ------------------------ > >> I kind off figured that out. But is the hash storage driver always > >> faster? I used a DB because i wanted to see how much tokens would be > >> created. So that was mainly for more statistics. If the hash storage > >> driver is always faster, than a DB is not realy usefull for > >> standalone > >> servers i suppose. > >> ------------------------ > >> > >>> word: 0.04667 seconds > >>> chain: 0.09 seconds > >>> osb: 0.47333 seconds > >>> sbph: 4.49333 seconds > >>> > >>> Hope it helps someone. > >>> > >>> Greetings, > >>> > >>> Ed van der Salm > >>> > >>> The Netherlands > >>> Amstelveen > >> > >> -- > >> Kind Regards from Switzerland, > >> > >> Stevan Bajić > >> > >> ------------------------ > >> Greetings! > >> @ > >> ------------------------ > >> > >> > > > > ------------------------------------------------------------------------------ > >> WhatsUp Gold - Download Free Network Management Software > >> The most intuitive, comprehensive, and cost-effective network > >> management toolset available today. Delivers lowest initial > >> acquisition cost and overall TCO of any competing solution. > >> http://p.sf.net/sfu/whatsupgold-sd [1] > >> _______________________________________________ > >> Dspam-user mailing list > >> Dspam-user@lists.sourceforge.net [2] > >> https://lists.sourceforge.net/lists/listinfo/dspam-user [3] > > > ------------------------------------------------------------------------------ > WhatsUp Gold - Download Free Network Management Software > The most intuitive, comprehensive, and cost-effective network > management toolset available today. Delivers lowest initial > acquisition cost and overall TCO of any competing solution. > http://p.sf.net/sfu/whatsupgold-sd > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user