Hi all,
(Maybe a monospaced font makes it more readable)
There seems to be some questions about the tokenizer to use (for me
to), so I thought it would be nice to have some statistics.
First about the setup I've chosen:
It's a clean install with no training done at al. I've made 6
directories containing spam and 6 containing ham. I thought I've read
somewhere to train 2 ham agains 1 spam so in those directories the
number of files are:
ham-01: 500
ham-02: 1000
ham-03: 1000
ham-04: 1000
ham-05: 500
ham-final: 200
spam-01: 250
spam-02: 500
spam-03: 500
spam-04: 500
spam-05: 250
spam-final: 100
Totaling: 6300 messages, 2100 spam and 4200 ham.
Some other info: Algorithm graham burton, and a MySQL database as
backend. There were only 55 'recent' spam messages. They came from my
gmail-account spambox. All other mails were training mails found
somewhere on the internet dating 2003, 2004 and 2005. In the final
batch there were 10 of the recent spams, the other 45 were spread in
the other batches.
This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There
were no other VM's running.
After that I trained using the word, chain, osb and sbph. I hope this
gives me the insight I want.
So, now for the real deal:
Token / batch: 01 02 03 04 05
final total tokens@db
word: FP: 0 0 2
2 0 1 5 205234
FN: 100 94 31
28 26 3 282
Time sec: 37 58 63
70 34 14 276 sec
chain: FP: 0 0 3
2 0 1 6 825549
FN: 77 59 10
10 14 3 173
Time: 46 79 90
111 46 27 399 sec
osb: FP: 1 1 3
3 0 0 8 2741757
FN: 74 73 18
11 13 4 193
Time: 80 126 218
469 397 142 1432 sec
sbph: FP: 1 1 2
6 0 0 10 13904366
FN: 65 60 10
6 10 3 154
Time: 544 3272 6843 8936
3532 1348 6h47m55s
Using osb my database grew up to 299Mb. Using sbph my database grew up
to 741Mb. The last collumn shows the number of tokens produced.
What is this all telling me...
That i'm a little disapointed. osb gave me more FP and FN than the
chain tokenizer did. Luckely in the final batch there were no FP's.
That's the one thing people can't live with. This also means you will
probably need arround 7500 ham messages and 3500+ (recent) spam
messages to get a proper training. The best training will be (dûh)
using real mail.
What more have I learned: If you are using the preferred tokenizer
(osb, if I followed the maillinglist right) the time needed to proces
a message increases pretty much. Looking at the final batch there is
an increase from chain to osb by 115 seconds, 27 to 142.
The biggest wait was using sbph, luckily i went out to jam with the
band ;). Stuppid as I am, I didn't automaticly start the final batch,
so at this moment I am waiting again.
My personal experience training with this mail batch (i've got another
3K+ trainingmessages) is not to good :(. Using sbph my db filed up my
disc. I did not expect it to grow that big. So, I started using osb,
but then I got too much FP's, and as stated before: a lot of unhappy
faces.
Well, in the final batch there are only two with no FP's, the osb and
sbph. sbph is 'the best' but not realy suited for bussy systems. Or
you just buy more power... Maybe I will just train a lot more and
return to the chain. One thing I noticed: average time per message in
the final batch.
word: 0.04667 seconds
chain: 0.09 seconds
osb: 0.47333 seconds
sbph: 4.49333 seconds
Hope it helps someone.
Greetings,
Ed van der Salm
The Netherlands
Amstelveen
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user