HTML MESSAGES ARE EVIL !

 use plain text please !

 (and no I'm not using mutt, that html bloab below was displayed on 
 roundcube)

 On Thu, 05 May 2011 14:43:30 +0200, Stevan Bajić wrote:
> <p>On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote:</p>
>  <blockquote><!-- html ignored --><!-- head ignored --><!-- meta 
> ignored
>  --><span style="font-family: arial;">Since I'm not behind the 
> machine
>  (it's at home) for now only the info I know.<br 
> /></span></blockquote>
>  <p>Okay.</p>
>  <p>&nbsp;</p>
>  <blockquote><span style="font-family: arial;">I installed a virtual
>  machine which i restored after each run.</span></blockquote>
>  <p>'each run' = after 5 batches + the final run</p>
>  <p>OR</p>
>  <p>'each run' = after switching tokenizer?</p>
>  <p>&nbsp;</p>
>  <blockquote><span style="font-family: arial;"> I left as much 
> settings
>  as possible at default, so the learning method was TEFT (that's 
> default,
>  right?)</span></blockquote>
>  <p>Yes. TEFT is default. TEFT is good suited for the dull tokenizers
>  like WORD and CHAIN (default). When using one of the more 
> intelligent
>  tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work
>  too).</p>
>  <p>&nbsp;</p>
>  <blockquote><span style="font-family: arial;">&nbsp;and all other
>  settings were untouched (apart from using MySQL db). Since i wanted 
> to
>  check dspam tokenizers, I left out all other stuff like DNSBL and
>  AV.</span></blockquote>
>  <p>Okay.</p>
>  <p>&nbsp;</p>
>  <blockquote><span style="font-family: arial;">I have not realy 
> checked
>  (myself I mean) all spams/hams they came from a file which could be 
> used
>  for Mailscanner&nbsp;training. Looking at the webinterface, it looks
>  like the all are English. It also looked like they were mail from
>  maillinglists etceterra. If someone has a nice batch of 'real' mail 
> to
>  throw at it, just send me the zip... :)<br /></span></blockquote>
>  <p>Then addressing this message to the DSPAM mailing list would be
>  helpful. I think you by mistake just wrote the message to me.</p>
>  <p>&nbsp;</p>
>  <blockquote>(It seems like my webmail-client doesn't allow 'nice'
>  inserts, so I've&nbsp;put my comments between
>  ------------------------)<br /></blockquote>
>  <p>Okay.</p>
>  <p>&nbsp;</p>
>  <blockquote>Op Donderdag, 05-05-2011 om 11:43 schreef Stevan 
> Bajić:<br
>  /></blockquote>
>  <blockquote><span style="font-family: arial;">
>  <blockquote style="border-bottom: 0px; border-left: #22437f  2px
>  solid; padding-bottom: 0px; margin: 0px  0px  0px  5px; 
> padding-left:
>  5px; padding-right: 0px; border-top: 0px; border-right: 0px;
>  padding-top: 0px;">On Thu, 05 May 2011 10:22:12 +0200, Ed van der 
> Salm
>  wrote:<br /><br />&gt; Hi all,<br />&gt;<br />&gt; (Maybe a 
> monospaced
>  font makes it more readable)<br />&gt; There seems to be some 
> questions
>  about the tokenizer to use (for me <br />&gt; to),<br />&gt; so I
>  thought it would be nice to have some statistics.<br />&gt;<br 
> />&gt;
>  First about the setup I've chosen:<br />&gt; It's a clean install 
> with
>  no training done at al. I've made 6 <br />&gt; directories<br />&gt;
>  containing spam and 6 containing ham. I thought I've read somewhere 
> <br
>  />&gt; to<br />&gt; train 2 ham agains 1 spam so in those 
> directories
>  the number of files<br />&gt; are:<br />&gt; ham-01: 500<br />&gt;
>  ham-02: 1000<br />&gt; ham-03: 1000<br />&gt; ham-04: 1000<br />&gt;
>  ham-05: 500<br />&gt; ham-final: 200<br />&gt; spam-01: 250<br 
> />&gt;
>  spam-02: 500<br />&gt; spam-03: 500<br />&gt; spam-04: 500<br />&gt;
>  spam-05: 250<br />&gt; spam-final: 100<br />&gt; Totaling: 6300
>  messages, 2100 spam and 4200 ham.<br />&gt;<br />&gt; Some other 
> info:
>  Algorithm graham burton, and a MySQL database as<br />&gt; backend.
>  There were only 55 'recent' spam messages. They came from my<br 
> />&gt;
>  gmail-account spambox. All other mails were training mails found<br
>  />&gt; somewhere on the internet dating 2003, 2004 and 2005. In the
>  final <br />&gt; batch<br />&gt; there were 10 of the recent spams, 
> the
>  other 45 were spread in the <br />&gt; other<br />&gt; batches.<br
>  />&gt; This all was done on a KVM virtual machine, 1 cpu and 1Gb 
> mem.
>  There <br />&gt; were<br />&gt; no other VM's running.<br />&gt; 
> After
>  that I trained using the word, chain, osb and sbph. I hope this<br
>  />&gt; gives me the insight I want.<br />&gt;<br />&gt; So, now for 
> the
>  real deal:<br />&gt;<br />&gt; Token / batch: 01 02 03 04 05 final 
> total
>  tokens@db<br />&gt; word: FP: 0 0 2 2 0 1 5 205234<br />&gt; FN: 100 
> 94
>  31 28 26 3 282<br />&gt; Time sec: 37 58 63 70 34 14 276 sec<br
>  />&gt;<br />&gt; chain: FP: 0 0 3 2 0 1 6 825549<br />&gt; FN: 77 59 
> 10
>  10 14 3 173<br />&gt; Time: 46 79 90 111 46 27 399 sec<br />&gt;<br
>  />&gt; osb: FP: 1 1 3 3 0 0 8 2741757<br />&gt; FN: 74 73 18 11 13 4
>  193<br />&gt; Time: 80 126 218 469 397 142 1432 sec<br />&gt;<br 
> />&gt;
>  sbph: FP: 1 1 2 6 0 0 10 13904366<br />&gt; FN: 65 60 10 6 10 3 
> 154<br
>  />&gt; Time: 544 3272 6843 8936 3532 1348 6h47m55s<br />&gt;<br 
> />&gt;
>  Using osb my database grew up to 299Mb. Using sbph my database grew 
> <br
>  />&gt; up to<br />&gt; 741Mb. The last collumn shows the number of
>  tokens produced.<br />&gt;<br />Some questions:<br />1) What 
> learning
>  method have you used? TEFT? TOE? TUM?<br />2) Are those batches
>  cumulative or have you wiped the data after each <br />training
>  batch?<br />3) What preferences have you in place?<br />4) Have you 
> done
>  anything in between the training batches? (stuff like <br />using
>  dspam_clean etc)<br />5) Have you used DNSBL within DSPAM?<br />6) 
> Are
>  the SPAM and HAM messages in the same language?<br /><br
>  />------------------------<br />I think I answered these questions 
> in my
>  intro above, if you miss something, tell me, then I will look in to
>  that. But that will be sometime late tonight.<br />(So, for 6: I 
> haven't
>  realy looked...)<br />------------------------</blockquote>
>  </span></blockquote>
>  <p><span style="font-family: arial;">Yes. You have done that in the
>  above response.</span></p>
>  <p><span style="font-family: arial;"><br /></span></p>
>  <blockquote><span style="font-family: arial;">
>  <blockquote style="border-bottom: 0px; border-left: #22437f  2px
>  solid; padding-bottom: 0px; margin: 0px  0px  0px  5px; 
> padding-left:
>  5px; padding-right: 0px; border-top: 0px; border-right: 0px;
>  padding-top: 0px;">&gt; What is this all telling me...<br />&gt;<br
>  />&gt; That i'm a little disapointed. osb gave me more FP and FN 
> than
>  the <br />&gt; chain<br />&gt; tokenizer did.<br />&gt;<br />Usually
>  OSB/SBPH result in less training (FP/FN) in the long run while <br
>  />WORD will constantly require training and CHAIN is somewhere in
>  between <br />WORD and OSB/SBPH.<br /><br 
> />------------------------<br
>  />I would agree with that.<br />The reason for splitting all 
> messages in
>  batches was to see the how the ongoing training would change the
>  results. And I think the figures say the same as you. Just like I
>  expected btw (but not as big of a difference as I expected)<br
>  />------------------------<br /><br />&gt; Luckely in the final 
> batch
>  there were no FP's. That's the<br />&gt; one thing people can't live
>  with. This also means you will probably <br />&gt; need<br />&gt;
>  arround 7500 ham messages and 3500+ (recent) spam messages to get 
> a<br
>  />&gt; proper training. The best training will be (d&ucirc;h) using 
> real
>  mail.<br />&gt;<br />I would not underwrite that statement. You can 
> not
>  conclude from your <br />test that in general one needs 7.5K 
> messages to
>  get a decent result. It <br />all depends what you train and how you
>  train.<br /><br />------------------------<br />OK, true... As far 
> as I
>  could see 4.2K ham&nbsp;and 2.1K spam just wasn't enough. And since 
> I
>  saw better results after more training I thought lets just shout 
> some
>  numbers... If somebody want's all messages i used, i can zip them 
> and
>  post it somewhere.<br />------------------------<br /><br />&gt; 
> What
>  more have I learned: If you are using the preferred tokenizer <br 
> />&gt;
>  (osb,<br />&gt; if I followed the maillinglist right) the time 
> needed to
>  proces a <br />&gt; message<br />&gt; increases pretty much. Looking 
> at
>  the final batch there is an <br />&gt; increase<br />&gt; from chain 
> to
>  osb by 115 seconds, 27 to 142.<br />&gt;<br />This strongly depends 
> on
>  the used training method. I think you used <br />TEFT on both
>  algorithms. Right?<br />------------------------<br />Yep, I only 
> used
>  TEFT (That's the default, right?) Ow, one thing: dspam 3.9.1rc1 
> :)<br
>  />------------------------<br /></blockquote>
>  </span></blockquote>
>  <p><span style="font-family: arial;">
>  <p>Yes. TEFT is the default.</p>
>  <p>&nbsp;</p>
>  </span></p>
>  <blockquote><span style="font-family: arial;">
>  <blockquote><span style="font-family: arial;"> <br />&gt; The 
> biggest
>  wait was using sbph, luckily i went out to jam with the <br />&gt;
>  band<br />&gt; ;). Stuppid as I am, I didn't automaticly start the 
> final
>  batch, so <br />&gt; at<br />&gt; this moment I am waiting again.<br
>  />&gt;<br />&gt; My personal experience training with this mail 
> batch
>  (i've got <br />&gt; another<br />&gt; 3K+ trainingmessages) is not 
> to
>  good :(. Using sbph my db filed up my<br />&gt; disc. I did not 
> expect
>  it to grow that big. So, I started using osb, <br />&gt; but<br 
> />&gt;
>  then I got too much FP's, and as stated before: a lot of unhappy <br
>  />&gt; faces.<br />&gt;<br />&gt; Well, in the final batch there are
>  only two with no FP's, the osb and<br />&gt; sbph. sbph is 'the 
> best'
>  but not realy suited for bussy systems. Or <br />&gt; you<br />&gt; 
> just
>  buy more power... Maybe I will just train a lot more and return <br
>  />&gt; to<br />&gt; the chain. One thing I noticed: average time per
>  message in the final<br />&gt; batch.<br />&gt;<br />SBPH is usually
>  best done using the Hash storage driver. Using a RDBMS <br />for 
> SBPH is
>  sub optimal.<br /><br />------------------------<br />I kind off 
> figured
>  that out. But is the hash storage driver always faster? I used a DB
>  because i wanted to see how much tokens would be created. So that 
> was
>  mainly for more statistics. If the hash storage driver is always 
> faster,
>  than a DB is not realy usefull for standalone servers i suppose.<br
>  />------------------------<br /></span></blockquote>
>  </span></blockquote>
>  <p><span style="font-family: arial;">Hmm... technically the Hash 
> driver
>  is +/- nothing other than a memory mapped file. A RDBMS is way more 
> than
>  just that. So you can make your own conclusion which one should be
>  technically faster.</span></p>
>  <p><span style="font-family: arial;"><br /></span></p>
>  <blockquote><span style="font-family: arial;">
>  <blockquote><span style="font-family: arial;">&gt; word: 0.04667
>  seconds<br />&gt; chain: 0.09 seconds<br />&gt; osb: 0.47333 
> seconds<br
>  />&gt; sbph: 4.49333 seconds<br />&gt;<br />&gt; Hope it helps
>  someone.<br />&gt;<br />&gt; Greetings,<br />&gt;<br />&gt; Ed van 
> der
>  Salm<br />&gt;<br />&gt; The Netherlands<br />&gt; Amstelveen<br 
> /><br
>  />-- <br />Kind Regards from Switzerland,<br /><br />Stevan Bajić<br
>  /><br />------------------------<br />Greetings!<br />@<br
>  />------------------------<br /><br /><br
>
> 
> />------------------------------------------------------------------------------<br
>
>  />WhatsUp Gold - Download Free Network Management Software<br />The
>  most intuitive, comprehensive, and cost-effective network <br
>  />management toolset available today.&nbsp;&nbsp;Delivers lowest 
> initial
>  <br />acquisition cost and overall TCO of any competing solution.<br
>  /><a class="normal-link"
>
> 
> href="http://p.sf.net/sfu/whatsupgold-sd";>http://p.sf.net/sfu/whatsupgold-sd</a><br
>
>  />_______________________________________________<br />Dspam-user
>  mailing list<br /><a class="normal-link"
>
> 
> href="mailto:Dspam-user@lists.sourceforge.net";>Dspam-user@lists.sourceforge.net</a><br
>
>  /><a class="normal-link"
>
> 
> href="https://lists.sourceforge.net/lists/listinfo/dspam-user";>https://lists.sourceforge.net/lists/listinfo/dspam-user</a></span></blockquote>
>  <p>&nbsp;</p>
>  </span></blockquote>
>  <p><span style="font-family: arial;"><span style="font-family: 
> arial;">
>  <pre>-- <br />Kind Regards from Switzerland,<br /><br />Stevan
>  Bajić</pre>
>  </span></span></p>
>
> 
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network
> management toolset available today.  Delivers lowest initial
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user


------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to