<p>On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote:</p> <blockquote><!-- html ignored --><!-- head ignored --><!-- meta ignored --><span style="font-family: arial;">Since I'm not behind the machine (it's at home) for now only the info I know.<br /></span></blockquote> <p>Okay.</p> <p> </p> <blockquote><span style="font-family: arial;">I installed a virtual machine which i restored after each run.</span></blockquote> <p>'each run' = after 5 batches + the final run</p> <p>OR</p> <p>'each run' = after switching tokenizer?</p> <p> </p> <blockquote><span style="font-family: arial;"> I left as much settings as possible at default, so the learning method was TEFT (that's default, right?)</span></blockquote> <p>Yes. TEFT is default. TEFT is good suited for the dull tokenizers like WORD and CHAIN (default). When using one of the more intelligent tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work too).</p> <p> </p> <blockquote><span style="font-family: arial;"> and all other settings were untouched (apart from using MySQL db). Since i wanted to check dspam tokenizers, I left out all other stuff like DNSBL and AV.</span></blockquote> <p>Okay.</p> <p> </p> <blockquote><span style="font-family: arial;">I have not realy checked (myself I mean) all spams/hams they came from a file which could be used for Mailscanner training. Looking at the webinterface, it looks like the all are English. It also looked like they were mail from maillinglists etceterra. If someone has a nice batch of 'real' mail to throw at it, just send me the zip... :)<br /></span></blockquote> <p>Then addressing this message to the DSPAM mailing list would be helpful. I think you by mistake just wrote the message to me.</p> <p> </p> <blockquote>(It seems like my webmail-client doesn't allow 'nice' inserts, so I've put my comments between ------------------------)<br /></blockquote> <p>Okay.</p> <p> </p> <blockquote>Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:<br /></blockquote> <blockquote><span style="font-family: arial;"> <blockquote style="border-bottom: 0px; border-left: #22437f 2px solid; padding-bottom: 0px; margin: 0px 0px 0px 5px; padding-left: 5px; padding-right: 0px; border-top: 0px; border-right: 0px; padding-top: 0px;">On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:<br /><br />> Hi all,<br />><br />> (Maybe a monospaced font makes it more readable)<br />> There seems to be some questions about the tokenizer to use (for me <br />> to),<br />> so I thought it would be nice to have some statistics.<br />><br />> First about the setup I've chosen:<br />> It's a clean install with no training done at al. I've made 6 <br />> directories<br />> containing spam and 6 containing ham. I thought I've read somewhere <br />> to<br />> train 2 ham agains 1 spam so in those directories the number of files<br />> are:<br />> ham-01: 500<br />> ham-02: 1000<br />> ham-03: 1000<br />> ham-04: 1000<br />> ham-05: 500<br />> ham-final: 200<br />> spam-01: 250<br />> spam-02: 500<br />> spam-03: 500<br />> spam-04: 500<br />> spam-05: 250<br />> spam-final: 100<br />> Totaling: 6300 messages, 2100 spam and 4200 ham.<br />><br />> Some other info: Algorithm graham burton, and a MySQL database as<br />> backend. There were only 55 'recent' spam messages. They came from my<br />> gmail-account spambox. All other mails were training mails found<br />> somewhere on the internet dating 2003, 2004 and 2005. In the final <br />> batch<br />> there were 10 of the recent spams, the other 45 were spread in the <br />> other<br />> batches.<br />> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There <br />> were<br />> no other VM's running.<br />> After that I trained using the word, chain, osb and sbph. I hope this<br />> gives me the insight I want.<br />><br />> So, now for the real deal:<br />><br />> Token / batch: 01 02 03 04 05 final total tokens@db<br />> word: FP: 0 0 2 2 0 1 5 205234<br />> FN: 100 94 31 28 26 3 282<br />> Time sec: 37 58 63 70 34 14 276 sec<br />><br />> chain: FP: 0 0 3 2 0 1 6 825549<br />> FN: 77 59 10 10 14 3 173<br />> Time: 46 79 90 111 46 27 399 sec<br />><br />> osb: FP: 1 1 3 3 0 0 8 2741757<br />> FN: 74 73 18 11 13 4 193<br />> Time: 80 126 218 469 397 142 1432 sec<br />><br />> sbph: FP: 1 1 2 6 0 0 10 13904366<br />> FN: 65 60 10 6 10 3 154<br />> Time: 544 3272 6843 8936 3532 1348 6h47m55s<br />><br />> Using osb my database grew up to 299Mb. Using sbph my database grew <br />> up to<br />> 741Mb. The last collumn shows the number of tokens produced.<br />><br />Some questions:<br />1) What learning method have you used? TEFT? TOE? TUM?<br />2) Are those batches cumulative or have you wiped the data after each <br />training batch?<br />3) What preferences have you in place?<br />4) Have you done anything in between the training batches? (stuff like <br />using dspam_clean etc)<br />5) Have you used DNSBL within DSPAM?<br />6) Are the SPAM and HAM messages in the same language?<br /><br />------------------------<br />I think I answered these questions in my intro above, if you miss something, tell me, then I will look in to that. But that will be sometime late tonight.<br />(So, for 6: I haven't realy looked...)<br />------------------------</blockquote> </span></blockquote> <p><span style="font-family: arial;">Yes. You have done that in the above response.</span></p> <p><span style="font-family: arial;"><br /></span></p> <blockquote><span style="font-family: arial;"> <blockquote style="border-bottom: 0px; border-left: #22437f 2px solid; padding-bottom: 0px; margin: 0px 0px 0px 5px; padding-left: 5px; padding-right: 0px; border-top: 0px; border-right: 0px; padding-top: 0px;">> What is this all telling me...<br />><br />> That i'm a little disapointed. osb gave me more FP and FN than the <br />> chain<br />> tokenizer did.<br />><br />Usually OSB/SBPH result in less training (FP/FN) in the long run while <br />WORD will constantly require training and CHAIN is somewhere in between <br />WORD and OSB/SBPH.<br /><br />------------------------<br />I would agree with that.<br />The reason for splitting all messages in batches was to see the how the ongoing training would change the results. And I think the figures say the same as you. Just like I expected btw (but not as big of a difference as I expected)<br />------------------------<br /><br />> Luckely in the final batch there were no FP's. That's the<br />> one thing people can't live with. This also means you will probably <br />> need<br />> arround 7500 ham messages and 3500+ (recent) spam messages to get a<br />> proper training. The best training will be (dûh) using real mail.<br />><br />I would not underwrite that statement. You can not conclude from your <br />test that in general one needs 7.5K messages to get a decent result. It <br />all depends what you train and how you train.<br /><br />------------------------<br />OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't enough. And since I saw better results after more training I thought lets just shout some numbers... If somebody want's all messages i used, i can zip them and post it somewhere.<br />------------------------<br /><br />> What more have I learned: If you are using the preferred tokenizer <br />> (osb,<br />> if I followed the maillinglist right) the time needed to proces a <br />> message<br />> increases pretty much. Looking at the final batch there is an <br />> increase<br />> from chain to osb by 115 seconds, 27 to 142.<br />><br />This strongly depends on the used training method. I think you used <br />TEFT on both algorithms. Right?<br />------------------------<br />Yep, I only used TEFT (That's the default, right?) Ow, one thing: dspam 3.9.1rc1 :)<br />------------------------<br /></blockquote> </span></blockquote> <p><span style="font-family: arial;"> <p>Yes. TEFT is the default.</p> <p> </p> </span></p> <blockquote><span style="font-family: arial;"> <blockquote><span style="font-family: arial;"> <br />> The biggest wait was using sbph, luckily i went out to jam with the <br />> band<br />> ;). Stuppid as I am, I didn't automaticly start the final batch, so <br />> at<br />> this moment I am waiting again.<br />><br />> My personal experience training with this mail batch (i've got <br />> another<br />> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up my<br />> disc. I did not expect it to grow that big. So, I started using osb, <br />> but<br />> then I got too much FP's, and as stated before: a lot of unhappy <br />> faces.<br />><br />> Well, in the final batch there are only two with no FP's, the osb and<br />> sbph. sbph is 'the best' but not realy suited for bussy systems. Or <br />> you<br />> just buy more power... Maybe I will just train a lot more and return <br />> to<br />> the chain. One thing I noticed: average time per message in the final<br />> batch.<br />><br />SBPH is usually best done using the Hash storage driver. Using a RDBMS <br />for SBPH is sub optimal.<br /><br />------------------------<br />I kind off figured that out. But is the hash storage driver always faster? I used a DB because i wanted to see how much tokens would be created. So that was mainly for more statistics. If the hash storage driver is always faster, than a DB is not realy usefull for standalone servers i suppose.<br />------------------------<br /></span></blockquote> </span></blockquote> <p><span style="font-family: arial;">Hmm... technically the Hash driver is +/- nothing other than a memory mapped file. A RDBMS is way more than just that. So you can make your own conclusion which one should be technically faster.</span></p> <p><span style="font-family: arial;"><br /></span></p> <blockquote><span style="font-family: arial;"> <blockquote><span style="font-family: arial;">> word: 0.04667 seconds<br />> chain: 0.09 seconds<br />> osb: 0.47333 seconds<br />> sbph: 4.49333 seconds<br />><br />> Hope it helps someone.<br />><br />> Greetings,<br />><br />> Ed van der Salm<br />><br />> The Netherlands<br />> Amstelveen<br /><br />-- <br />Kind Regards from Switzerland,<br /><br />Stevan Bajić<br /><br />------------------------<br />Greetings!<br />@<br />------------------------<br /><br /><br />------------------------------------------------------------------------------<br />WhatsUp Gold - Download Free Network Management Software<br />The most intuitive, comprehensive, and cost-effective network <br />management toolset available today. Delivers lowest initial <br />acquisition cost and overall TCO of any competing solution.<br /><a class="normal-link" href="http://p.sf.net/sfu/whatsupgold-sd">http://p.sf.net/sfu/whatsupgold-sd</a><br />_______________________________________________<br />Dspam-user mailing list<br /><a class="normal-link" href="mailto:Dspam-user@lists.sourceforge.net">Dspam-user@lists.sourceforge.net</a><br /><a class="normal-link" href="https://lists.sourceforge.net/lists/listinfo/dspam-user">https://lists.sourceforge.net/lists/listinfo/dspam-user</a></span></blockquote> <p> </p> </span></blockquote> <p><span style="font-family: arial;"><span style="font-family: arial;"> <pre>-- <br />Kind Regards from Switzerland,<br /><br />Stevan Bajić</pre> </span></span></p>
------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user