Re: [Dspam-user] Some tokenizer statistics

Stevan Bajić Thu, 05 May 2011 05:52:23 -0700

 <p>On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote:</p>
 <blockquote><!-- html ignored --><!-- head ignored --><!-- meta ignored 
 --><span style="font-family: arial;">Since I'm not behind the machine 
 (it's at home) for now only the info I know.<br /></span></blockquote>
 <p>Okay.</p>
 <p>&nbsp;</p>
 <blockquote><span style="font-family: arial;">I installed a virtual 
 machine which i restored after each run.</span></blockquote>
 <p>'each run' = after 5 batches + the final run</p>
 <p>OR</p>
 <p>'each run' = after switching tokenizer?</p>
 <p>&nbsp;</p>
 <blockquote><span style="font-family: arial;"> I left as much settings 
 as possible at default, so the learning method was TEFT (that's default, 
 right?)</span></blockquote>
 <p>Yes. TEFT is default. TEFT is good suited for the dull tokenizers 
 like WORD and CHAIN (default). When using one of the more intelligent 
 tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work 
 too).</p>
 <p>&nbsp;</p>
 <blockquote><span style="font-family: arial;">&nbsp;and all other 
 settings were untouched (apart from using MySQL db). Since i wanted to 
 check dspam tokenizers, I left out all other stuff like DNSBL and 
 AV.</span></blockquote>
 <p>Okay.</p>
 <p>&nbsp;</p>
 <blockquote><span style="font-family: arial;">I have not realy checked 
 (myself I mean) all spams/hams they came from a file which could be used 
 for Mailscanner&nbsp;training. Looking at the webinterface, it looks 
 like the all are English. It also looked like they were mail from 
 maillinglists etceterra. If someone has a nice batch of 'real' mail to 
 throw at it, just send me the zip... :)<br /></span></blockquote>
 <p>Then addressing this message to the DSPAM mailing list would be 
 helpful. I think you by mistake just wrote the message to me.</p>
 <p>&nbsp;</p>
 <blockquote>(It seems like my webmail-client doesn't allow 'nice' 
 inserts, so I've&nbsp;put my comments between 
 ------------------------)<br /></blockquote>
 <p>Okay.</p>
 <p>&nbsp;</p>
 <blockquote>Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:<br 
 /></blockquote>
 <blockquote><span style="font-family: arial;">
 <blockquote style="border-bottom: 0px; border-left: #22437f  2px  
 solid; padding-bottom: 0px; margin: 0px  0px  0px  5px; padding-left: 
 5px; padding-right: 0px; border-top: 0px; border-right: 0px; 
 padding-top: 0px;">On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm 
 wrote:<br /><br />&gt; Hi all,<br />&gt;<br />&gt; (Maybe a monospaced 
 font makes it more readable)<br />&gt; There seems to be some questions 
 about the tokenizer to use (for me <br />&gt; to),<br />&gt; so I 
 thought it would be nice to have some statistics.<br />&gt;<br />&gt; 
 First about the setup I've chosen:<br />&gt; It's a clean install with 
 no training done at al. I've made 6 <br />&gt; directories<br />&gt; 
 containing spam and 6 containing ham. I thought I've read somewhere <br 
 />&gt; to<br />&gt; train 2 ham agains 1 spam so in those directories 
 the number of files<br />&gt; are:<br />&gt; ham-01: 500<br />&gt; 
 ham-02: 1000<br />&gt; ham-03: 1000<br />&gt; ham-04: 1000<br />&gt; 
 ham-05: 500<br />&gt; ham-final: 200<br />&gt; spam-01: 250<br />&gt; 
 spam-02: 500<br />&gt; spam-03: 500<br />&gt; spam-04: 500<br />&gt; 
 spam-05: 250<br />&gt; spam-final: 100<br />&gt; Totaling: 6300 
 messages, 2100 spam and 4200 ham.<br />&gt;<br />&gt; Some other info: 
 Algorithm graham burton, and a MySQL database as<br />&gt; backend. 
 There were only 55 'recent' spam messages. They came from my<br />&gt; 
 gmail-account spambox. All other mails were training mails found<br 
 />&gt; somewhere on the internet dating 2003, 2004 and 2005. In the 
 final <br />&gt; batch<br />&gt; there were 10 of the recent spams, the 
 other 45 were spread in the <br />&gt; other<br />&gt; batches.<br 
 />&gt; This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. 
 There <br />&gt; were<br />&gt; no other VM's running.<br />&gt; After 
 that I trained using the word, chain, osb and sbph. I hope this<br 
 />&gt; gives me the insight I want.<br />&gt;<br />&gt; So, now for the 
 real deal:<br />&gt;<br />&gt; Token / batch: 01 02 03 04 05 final total 
 tokens@db<br />&gt; word: FP: 0 0 2 2 0 1 5 205234<br />&gt; FN: 100 94 
 31 28 26 3 282<br />&gt; Time sec: 37 58 63 70 34 14 276 sec<br 
 />&gt;<br />&gt; chain: FP: 0 0 3 2 0 1 6 825549<br />&gt; FN: 77 59 10 
 10 14 3 173<br />&gt; Time: 46 79 90 111 46 27 399 sec<br />&gt;<br 
 />&gt; osb: FP: 1 1 3 3 0 0 8 2741757<br />&gt; FN: 74 73 18 11 13 4 
 193<br />&gt; Time: 80 126 218 469 397 142 1432 sec<br />&gt;<br />&gt; 
 sbph: FP: 1 1 2 6 0 0 10 13904366<br />&gt; FN: 65 60 10 6 10 3 154<br 
 />&gt; Time: 544 3272 6843 8936 3532 1348 6h47m55s<br />&gt;<br />&gt; 
 Using osb my database grew up to 299Mb. Using sbph my database grew <br 
 />&gt; up to<br />&gt; 741Mb. The last collumn shows the number of 
 tokens produced.<br />&gt;<br />Some questions:<br />1) What learning 
 method have you used? TEFT? TOE? TUM?<br />2) Are those batches 
 cumulative or have you wiped the data after each <br />training 
 batch?<br />3) What preferences have you in place?<br />4) Have you done 
 anything in between the training batches? (stuff like <br />using 
 dspam_clean etc)<br />5) Have you used DNSBL within DSPAM?<br />6) Are 
 the SPAM and HAM messages in the same language?<br /><br 
 />------------------------<br />I think I answered these questions in my 
 intro above, if you miss something, tell me, then I will look in to 
 that. But that will be sometime late tonight.<br />(So, for 6: I haven't 
 realy looked...)<br />------------------------</blockquote>
 </span></blockquote>
 <p><span style="font-family: arial;">Yes. You have done that in the 
 above response.</span></p>
 <p><span style="font-family: arial;"><br /></span></p>
 <blockquote><span style="font-family: arial;">
 <blockquote style="border-bottom: 0px; border-left: #22437f  2px  
 solid; padding-bottom: 0px; margin: 0px  0px  0px  5px; padding-left: 
 5px; padding-right: 0px; border-top: 0px; border-right: 0px; 
 padding-top: 0px;">&gt; What is this all telling me...<br />&gt;<br 
 />&gt; That i'm a little disapointed. osb gave me more FP and FN than 
 the <br />&gt; chain<br />&gt; tokenizer did.<br />&gt;<br />Usually 
 OSB/SBPH result in less training (FP/FN) in the long run while <br 
 />WORD will constantly require training and CHAIN is somewhere in 
 between <br />WORD and OSB/SBPH.<br /><br />------------------------<br 
 />I would agree with that.<br />The reason for splitting all messages in 
 batches was to see the how the ongoing training would change the 
 results. And I think the figures say the same as you. Just like I 
 expected btw (but not as big of a difference as I expected)<br 
 />------------------------<br /><br />&gt; Luckely in the final batch 
 there were no FP's. That's the<br />&gt; one thing people can't live 
 with. This also means you will probably <br />&gt; need<br />&gt; 
 arround 7500 ham messages and 3500+ (recent) spam messages to get a<br 
 />&gt; proper training. The best training will be (d&ucirc;h) using real 
 mail.<br />&gt;<br />I would not underwrite that statement. You can not 
 conclude from your <br />test that in general one needs 7.5K messages to 
 get a decent result. It <br />all depends what you train and how you 
 train.<br /><br />------------------------<br />OK, true... As far as I 
 could see 4.2K ham&nbsp;and 2.1K spam just wasn't enough. And since I 
 saw better results after more training I thought lets just shout some 
 numbers... If somebody want's all messages i used, i can zip them and 
 post it somewhere.<br />------------------------<br /><br />&gt; What 
 more have I learned: If you are using the preferred tokenizer <br />&gt; 
 (osb,<br />&gt; if I followed the maillinglist right) the time needed to 
 proces a <br />&gt; message<br />&gt; increases pretty much. Looking at 
 the final batch there is an <br />&gt; increase<br />&gt; from chain to 
 osb by 115 seconds, 27 to 142.<br />&gt;<br />This strongly depends on 
 the used training method. I think you used <br />TEFT on both 
 algorithms. Right?<br />------------------------<br />Yep, I only used 
 TEFT (That's the default, right?) Ow, one thing: dspam 3.9.1rc1 :)<br 
 />------------------------<br /></blockquote>
 </span></blockquote>
 <p><span style="font-family: arial;">
 <p>Yes. TEFT is the default.</p>
 <p>&nbsp;</p>
 </span></p>
 <blockquote><span style="font-family: arial;">
 <blockquote><span style="font-family: arial;"> <br />&gt; The biggest 
 wait was using sbph, luckily i went out to jam with the <br />&gt; 
 band<br />&gt; ;). Stuppid as I am, I didn't automaticly start the final 
 batch, so <br />&gt; at<br />&gt; this moment I am waiting again.<br 
 />&gt;<br />&gt; My personal experience training with this mail batch 
 (i've got <br />&gt; another<br />&gt; 3K+ trainingmessages) is not to 
 good :(. Using sbph my db filed up my<br />&gt; disc. I did not expect 
 it to grow that big. So, I started using osb, <br />&gt; but<br />&gt; 
 then I got too much FP's, and as stated before: a lot of unhappy <br 
 />&gt; faces.<br />&gt;<br />&gt; Well, in the final batch there are 
 only two with no FP's, the osb and<br />&gt; sbph. sbph is 'the best' 
 but not realy suited for bussy systems. Or <br />&gt; you<br />&gt; just 
 buy more power... Maybe I will just train a lot more and return <br 
 />&gt; to<br />&gt; the chain. One thing I noticed: average time per 
 message in the final<br />&gt; batch.<br />&gt;<br />SBPH is usually 
 best done using the Hash storage driver. Using a RDBMS <br />for SBPH is 
 sub optimal.<br /><br />------------------------<br />I kind off figured 
 that out. But is the hash storage driver always faster? I used a DB 
 because i wanted to see how much tokens would be created. So that was 
 mainly for more statistics. If the hash storage driver is always faster, 
 than a DB is not realy usefull for standalone servers i suppose.<br 
 />------------------------<br /></span></blockquote>
 </span></blockquote>
 <p><span style="font-family: arial;">Hmm... technically the Hash driver 
 is +/- nothing other than a memory mapped file. A RDBMS is way more than 
 just that. So you can make your own conclusion which one should be 
 technically faster.</span></p>
 <p><span style="font-family: arial;"><br /></span></p>
 <blockquote><span style="font-family: arial;">
 <blockquote><span style="font-family: arial;">&gt; word: 0.04667 
 seconds<br />&gt; chain: 0.09 seconds<br />&gt; osb: 0.47333 seconds<br 
 />&gt; sbph: 4.49333 seconds<br />&gt;<br />&gt; Hope it helps 
 someone.<br />&gt;<br />&gt; Greetings,<br />&gt;<br />&gt; Ed van der 
 Salm<br />&gt;<br />&gt; The Netherlands<br />&gt; Amstelveen<br /><br 
 />-- <br />Kind Regards from Switzerland,<br /><br />Stevan Bajić<br 
 /><br />------------------------<br />Greetings!<br />@<br 
 />------------------------<br /><br /><br 
 
/>------------------------------------------------------------------------------<br
 
 />WhatsUp Gold - Download Free Network Management Software<br />The 
 most intuitive, comprehensive, and cost-effective network <br 
 />management toolset available today.&nbsp;&nbsp;Delivers lowest initial 
 <br />acquisition cost and overall TCO of any competing solution.<br 
 /><a class="normal-link" 
 
href="http://p.sf.net/sfu/whatsupgold-sd";>http://p.sf.net/sfu/whatsupgold-sd</a><br
 
 />_______________________________________________<br />Dspam-user 
 mailing list<br /><a class="normal-link" 
 
href="mailto:Dspam-user@lists.sourceforge.net";>Dspam-user@lists.sourceforge.net</a><br
 
 /><a class="normal-link" 
 
href="https://lists.sourceforge.net/lists/listinfo/dspam-user";>https://lists.sourceforge.net/lists/listinfo/dspam-user</a></span></blockquote>
 <p>&nbsp;</p>
 </span></blockquote>
 <p><span style="font-family: arial;"><span style="font-family: arial;">
 <pre>-- <br />Kind Regards from Switzerland,<br /><br />Stevan 
 Bajić</pre>
 </span></span></p>


------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Some tokenizer statistics

Reply via email to