HTML MESSAGES ARE EVIL ! use plain text please !
(and no I'm not using mutt, that html bloab below was displayed on roundcube) On Thu, 05 May 2011 14:43:30 +0200, Stevan Bajić wrote: > <p>On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote:</p> > <blockquote><!-- html ignored --><!-- head ignored --><!-- meta > ignored > --><span style="font-family: arial;">Since I'm not behind the > machine > (it's at home) for now only the info I know.<br > /></span></blockquote> > <p>Okay.</p> > <p> </p> > <blockquote><span style="font-family: arial;">I installed a virtual > machine which i restored after each run.</span></blockquote> > <p>'each run' = after 5 batches + the final run</p> > <p>OR</p> > <p>'each run' = after switching tokenizer?</p> > <p> </p> > <blockquote><span style="font-family: arial;"> I left as much > settings > as possible at default, so the learning method was TEFT (that's > default, > right?)</span></blockquote> > <p>Yes. TEFT is default. TEFT is good suited for the dull tokenizers > like WORD and CHAIN (default). When using one of the more > intelligent > tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work > too).</p> > <p> </p> > <blockquote><span style="font-family: arial;"> and all other > settings were untouched (apart from using MySQL db). Since i wanted > to > check dspam tokenizers, I left out all other stuff like DNSBL and > AV.</span></blockquote> > <p>Okay.</p> > <p> </p> > <blockquote><span style="font-family: arial;">I have not realy > checked > (myself I mean) all spams/hams they came from a file which could be > used > for Mailscanner training. Looking at the webinterface, it looks > like the all are English. It also looked like they were mail from > maillinglists etceterra. If someone has a nice batch of 'real' mail > to > throw at it, just send me the zip... :)<br /></span></blockquote> > <p>Then addressing this message to the DSPAM mailing list would be > helpful. I think you by mistake just wrote the message to me.</p> > <p> </p> > <blockquote>(It seems like my webmail-client doesn't allow 'nice' > inserts, so I've put my comments between > ------------------------)<br /></blockquote> > <p>Okay.</p> > <p> </p> > <blockquote>Op Donderdag, 05-05-2011 om 11:43 schreef Stevan > Bajić:<br > /></blockquote> > <blockquote><span style="font-family: arial;"> > <blockquote style="border-bottom: 0px; border-left: #22437f 2px > solid; padding-bottom: 0px; margin: 0px 0px 0px 5px; > padding-left: > 5px; padding-right: 0px; border-top: 0px; border-right: 0px; > padding-top: 0px;">On Thu, 05 May 2011 10:22:12 +0200, Ed van der > Salm > wrote:<br /><br />> Hi all,<br />><br />> (Maybe a > monospaced > font makes it more readable)<br />> There seems to be some > questions > about the tokenizer to use (for me <br />> to),<br />> so I > thought it would be nice to have some statistics.<br />><br > />> > First about the setup I've chosen:<br />> It's a clean install > with > no training done at al. I've made 6 <br />> directories<br />> > containing spam and 6 containing ham. I thought I've read somewhere > <br > />> to<br />> train 2 ham agains 1 spam so in those > directories > the number of files<br />> are:<br />> ham-01: 500<br />> > ham-02: 1000<br />> ham-03: 1000<br />> ham-04: 1000<br />> > ham-05: 500<br />> ham-final: 200<br />> spam-01: 250<br > />> > spam-02: 500<br />> spam-03: 500<br />> spam-04: 500<br />> > spam-05: 250<br />> spam-final: 100<br />> Totaling: 6300 > messages, 2100 spam and 4200 ham.<br />><br />> Some other > info: > Algorithm graham burton, and a MySQL database as<br />> backend. > There were only 55 'recent' spam messages. They came from my<br > />> > gmail-account spambox. All other mails were training mails found<br > />> somewhere on the internet dating 2003, 2004 and 2005. In the > final <br />> batch<br />> there were 10 of the recent spams, > the > other 45 were spread in the <br />> other<br />> batches.<br > />> This all was done on a KVM virtual machine, 1 cpu and 1Gb > mem. > There <br />> were<br />> no other VM's running.<br />> > After > that I trained using the word, chain, osb and sbph. I hope this<br > />> gives me the insight I want.<br />><br />> So, now for > the > real deal:<br />><br />> Token / batch: 01 02 03 04 05 final > total > tokens@db<br />> word: FP: 0 0 2 2 0 1 5 205234<br />> FN: 100 > 94 > 31 28 26 3 282<br />> Time sec: 37 58 63 70 34 14 276 sec<br > />><br />> chain: FP: 0 0 3 2 0 1 6 825549<br />> FN: 77 59 > 10 > 10 14 3 173<br />> Time: 46 79 90 111 46 27 399 sec<br />><br > />> osb: FP: 1 1 3 3 0 0 8 2741757<br />> FN: 74 73 18 11 13 4 > 193<br />> Time: 80 126 218 469 397 142 1432 sec<br />><br > />> > sbph: FP: 1 1 2 6 0 0 10 13904366<br />> FN: 65 60 10 6 10 3 > 154<br > />> Time: 544 3272 6843 8936 3532 1348 6h47m55s<br />><br > />> > Using osb my database grew up to 299Mb. Using sbph my database grew > <br > />> up to<br />> 741Mb. The last collumn shows the number of > tokens produced.<br />><br />Some questions:<br />1) What > learning > method have you used? TEFT? TOE? TUM?<br />2) Are those batches > cumulative or have you wiped the data after each <br />training > batch?<br />3) What preferences have you in place?<br />4) Have you > done > anything in between the training batches? (stuff like <br />using > dspam_clean etc)<br />5) Have you used DNSBL within DSPAM?<br />6) > Are > the SPAM and HAM messages in the same language?<br /><br > />------------------------<br />I think I answered these questions > in my > intro above, if you miss something, tell me, then I will look in to > that. But that will be sometime late tonight.<br />(So, for 6: I > haven't > realy looked...)<br />------------------------</blockquote> > </span></blockquote> > <p><span style="font-family: arial;">Yes. You have done that in the > above response.</span></p> > <p><span style="font-family: arial;"><br /></span></p> > <blockquote><span style="font-family: arial;"> > <blockquote style="border-bottom: 0px; border-left: #22437f 2px > solid; padding-bottom: 0px; margin: 0px 0px 0px 5px; > padding-left: > 5px; padding-right: 0px; border-top: 0px; border-right: 0px; > padding-top: 0px;">> What is this all telling me...<br />><br > />> That i'm a little disapointed. osb gave me more FP and FN > than > the <br />> chain<br />> tokenizer did.<br />><br />Usually > OSB/SBPH result in less training (FP/FN) in the long run while <br > />WORD will constantly require training and CHAIN is somewhere in > between <br />WORD and OSB/SBPH.<br /><br > />------------------------<br > />I would agree with that.<br />The reason for splitting all > messages in > batches was to see the how the ongoing training would change the > results. And I think the figures say the same as you. Just like I > expected btw (but not as big of a difference as I expected)<br > />------------------------<br /><br />> Luckely in the final > batch > there were no FP's. That's the<br />> one thing people can't live > with. This also means you will probably <br />> need<br />> > arround 7500 ham messages and 3500+ (recent) spam messages to get > a<br > />> proper training. The best training will be (dûh) using > real > mail.<br />><br />I would not underwrite that statement. You can > not > conclude from your <br />test that in general one needs 7.5K > messages to > get a decent result. It <br />all depends what you train and how you > train.<br /><br />------------------------<br />OK, true... As far > as I > could see 4.2K ham and 2.1K spam just wasn't enough. And since > I > saw better results after more training I thought lets just shout > some > numbers... If somebody want's all messages i used, i can zip them > and > post it somewhere.<br />------------------------<br /><br />> > What > more have I learned: If you are using the preferred tokenizer <br > />> > (osb,<br />> if I followed the maillinglist right) the time > needed to > proces a <br />> message<br />> increases pretty much. Looking > at > the final batch there is an <br />> increase<br />> from chain > to > osb by 115 seconds, 27 to 142.<br />><br />This strongly depends > on > the used training method. I think you used <br />TEFT on both > algorithms. Right?<br />------------------------<br />Yep, I only > used > TEFT (That's the default, right?) Ow, one thing: dspam 3.9.1rc1 > :)<br > />------------------------<br /></blockquote> > </span></blockquote> > <p><span style="font-family: arial;"> > <p>Yes. TEFT is the default.</p> > <p> </p> > </span></p> > <blockquote><span style="font-family: arial;"> > <blockquote><span style="font-family: arial;"> <br />> The > biggest > wait was using sbph, luckily i went out to jam with the <br />> > band<br />> ;). Stuppid as I am, I didn't automaticly start the > final > batch, so <br />> at<br />> this moment I am waiting again.<br > />><br />> My personal experience training with this mail > batch > (i've got <br />> another<br />> 3K+ trainingmessages) is not > to > good :(. Using sbph my db filed up my<br />> disc. I did not > expect > it to grow that big. So, I started using osb, <br />> but<br > />> > then I got too much FP's, and as stated before: a lot of unhappy <br > />> faces.<br />><br />> Well, in the final batch there are > only two with no FP's, the osb and<br />> sbph. sbph is 'the > best' > but not realy suited for bussy systems. Or <br />> you<br />> > just > buy more power... Maybe I will just train a lot more and return <br > />> to<br />> the chain. One thing I noticed: average time per > message in the final<br />> batch.<br />><br />SBPH is usually > best done using the Hash storage driver. Using a RDBMS <br />for > SBPH is > sub optimal.<br /><br />------------------------<br />I kind off > figured > that out. But is the hash storage driver always faster? I used a DB > because i wanted to see how much tokens would be created. So that > was > mainly for more statistics. If the hash storage driver is always > faster, > than a DB is not realy usefull for standalone servers i suppose.<br > />------------------------<br /></span></blockquote> > </span></blockquote> > <p><span style="font-family: arial;">Hmm... technically the Hash > driver > is +/- nothing other than a memory mapped file. A RDBMS is way more > than > just that. So you can make your own conclusion which one should be > technically faster.</span></p> > <p><span style="font-family: arial;"><br /></span></p> > <blockquote><span style="font-family: arial;"> > <blockquote><span style="font-family: arial;">> word: 0.04667 > seconds<br />> chain: 0.09 seconds<br />> osb: 0.47333 > seconds<br > />> sbph: 4.49333 seconds<br />><br />> Hope it helps > someone.<br />><br />> Greetings,<br />><br />> Ed van > der > Salm<br />><br />> The Netherlands<br />> Amstelveen<br > /><br > />-- <br />Kind Regards from Switzerland,<br /><br />Stevan Bajić<br > /><br />------------------------<br />Greetings!<br />@<br > />------------------------<br /><br /><br > > > />------------------------------------------------------------------------------<br > > />WhatsUp Gold - Download Free Network Management Software<br />The > most intuitive, comprehensive, and cost-effective network <br > />management toolset available today. Delivers lowest > initial > <br />acquisition cost and overall TCO of any competing solution.<br > /><a class="normal-link" > > > href="http://p.sf.net/sfu/whatsupgold-sd">http://p.sf.net/sfu/whatsupgold-sd</a><br > > />_______________________________________________<br />Dspam-user > mailing list<br /><a class="normal-link" > > > href="mailto:Dspam-user@lists.sourceforge.net">Dspam-user@lists.sourceforge.net</a><br > > /><a class="normal-link" > > > href="https://lists.sourceforge.net/lists/listinfo/dspam-user">https://lists.sourceforge.net/lists/listinfo/dspam-user</a></span></blockquote> > <p> </p> > </span></blockquote> > <p><span style="font-family: arial;"><span style="font-family: > arial;"> > <pre>-- <br />Kind Regards from Switzerland,<br /><br />Stevan > Bajić</pre> > </span></span></p> > > > ------------------------------------------------------------------------------ > WhatsUp Gold - Download Free Network Management Software > The most intuitive, comprehensive, and cost-effective network > management toolset available today. Delivers lowest initial > acquisition cost and overall TCO of any competing solution. > http://p.sf.net/sfu/whatsupgold-sd > _______________________________________________ > Dspam-user mailing list > Dspam-user@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dspam-user ------------------------------------------------------------------------------ WhatsUp Gold - Download Free Network Management Software The most intuitive, comprehensive, and cost-effective network management toolset available today. Delivers lowest initial acquisition cost and overall TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user