[Dspam-user] Some tokenizer statistics

Ed van der Salm Thu, 05 May 2011 01:46:35 -0700

Hi all,

(Maybe a monospaced font makes it more readable)
There seems to be some questions about the tokenizer to use (for me
to), so I thought it would be nice to have some statistics.


First about the setup I've chosen:
It's a clean install with no training done at al. I've made 6
directories containing spam and 6 containing ham. I thought I've read
somewhere to train 2 ham agains 1 spam so in those directories the
number of files are:
ham-01:     500
ham-02:    1000
ham-03:    1000
ham-04:    1000
ham-05:     500
ham-final:  200
spam-01:    250
spam-02:    500
spam-03:    500
spam-04:    500
spam-05:    250
spam-final: 100
Totaling: 6300 messages, 2100 spam and 4200 ham.

Some other info: Algorithm graham burton, and a MySQL database as
backend. There were only 55 'recent' spam messages. They came from my
gmail-account spambox. All other mails were training mails found
somewhere on the internet dating 2003, 2004 and 2005. In the final
batch there were 10 of the recent spams, the other 45 were spread in
the other batches.
This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There
were no other VM's running.
After that I trained using the word, chain, osb and sbph. I hope this
gives me the insight I want.

So, now for the real deal:

Token / batch:   01     02      03      04      05   
final     total    tokens@db
word:   FP:      0       0       2      
2       0        1         5       205234
        FN:    100      94      31     
28      26        3       282
  Time sec:     37      58      63     
70      34       14       276 sec

chain:  FP:      0       0       3      
2       0        1         6       825549
        FN:     77      59      10     
10      14        3       173
      Time:     46      79      90    
111      46       27       399 sec

osb:    FP:      1       1       3      
3       0        0         8      2741757
        FN:     74      73      18     
11      13        4       193
      Time:     80     126     218    
469     397      142      1432 sec

sbph:   FP:      1       1       2      
6       0        0        10     13904366
        FN:     65      60      10      
6      10        3       154
      Time:    544    3272    6843    8936   
3532     1348   6h47m55s

Using osb my database grew up to 299Mb. Using sbph my database grew up
to 741Mb. The last collumn shows the number of tokens produced.

What is this all telling me...

That i'm a little disapointed. osb gave me more FP and FN than the
chain tokenizer did. Luckely in the final batch there were no FP's.
That's the one thing people can't live with. This also means you will
probably need arround 7500 ham messages and 3500+ (recent) spam
messages to get a proper training. The best training will be (dûh)
using real mail. 

What more have I learned: If you are using the preferred tokenizer
(osb, if I followed the maillinglist right) the time needed to proces
a message increases pretty much. Looking at the final batch there is
an increase from chain to osb by 115 seconds, 27 to 142.
The biggest wait was using sbph, luckily i went out to jam with the
band ;). Stuppid as I am, I didn't automaticly start the final batch,
so at this moment I am waiting again.

My personal experience training with this mail batch (i've got another
3K+ trainingmessages) is not to good :(. Using sbph my db filed up my
disc. I did not expect it to grow that big. So, I started using osb,
but then I got too much FP's, and as stated before: a lot of unhappy
faces.

Well, in the final batch there are only two with no FP's, the osb and
sbph. sbph is 'the best' but not realy suited for bussy systems. Or
you just buy more power... Maybe I will just train a lot more and
return to the chain. One thing I noticed: average time per message in
the final batch.
word:  0.04667 seconds
chain: 0.09    seconds
osb:   0.47333 seconds
sbph:  4.49333 seconds

Hope it helps someone.

Greetings,

Ed van der Salm

The Netherlands
Amstelveen

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd

_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

[Dspam-user] Some tokenizer statistics

Reply via email to