------------------------ Ursprüngliche Nachricht
-------------------------
Betreff: Some tokenizer statistics
Von: "Ed
van der Salm"
Datum: Do, 5.05.2011, 01:02
An:
"dspam-user@lists.sourceforge.net"
--------------------------------------------------------------------------
Hi all,
(Maybe a monospaced font makes it more readable)
There
seems to be some questions about the tokenizer to use (for me to), so I
thought it would be nice to have some statistics.
First about the setup
I've chosen:
It's a clean install with no training done at al. I've made
6 directories containing spam and 6 containing ham. I thought I've read
somewhere to train 2 ham agains 1 spam so in those directories the
number of files are:
ham-01: 500
ham-02: 1000
ham-03: 1000
ham-04:
1000
ham-05: 500
ham-final: 200
spam-01: 250
spam-02: 500
spam-03:
500
spam-04: 500
spam-05: 250
spam-final: 100
Totaling: 6300 messages,
2100 spam and 4200 ham.
Some other info: Algorithm graham burton, and a
MySQL database as backend. There were only 55 'recent' spam messages.
They came from my gmail-account spambox. All other mails were training
mails found somewhere on the internet dating 2003, 2004 and 2005. In the
final batch there were 10 of the recent spams, the other 45 were spread
in the other batches.
This all was done on a KVM virtual machine, 1 cpu
and 1Gb mem. There were no other VM's running.
After that I trained
using the word, chain, osb and sbph. I hope this gives me the insight I
want.
So, now for the real deal:
Token / batch: 01 02 03 04 05 final
total tokens@db
word: FP: 0 0 2 2 0 1 5 205234
FN: 100 94 31 28 26 3
282
Time sec: 37 58 63 70 34 14 276 sec
chain: FP: 0 0 3 2 0 1 6
825549
FN: 77 59 10 10 14 3 173
Time: 46 79 90 111 46 27 399 sec
osb:
FP: 1 1 3 3 0 0 8 2741757
FN: 74 73 18 11 13 4 193
Time: 80 126 218
469 397 142 1432 sec
sbph: FP: 1 1 2 6 0 0 10 13904366
FN: 65 60 10 6
10 3 154
Time: 544 3272 6843 8936 3532 1348 6h47m55s
Using osb my
database grew up to 299Mb. Using sbph my database grew up to 741Mb. The
last collumn shows the number of tokens produced.
What is this all
telling me...
That i'm a little disapointed. osb gave me more FP and FN
than the chain tokenizer did. Luckely in the final batch there were no
FP's. That's the one thing people can't live with. This also means you
will probably need arround 7500 ham messages and 3500+ (recent) spam
messages to get a proper training. The best training will be (dûh) using
real mail.
What more have I learned: If you are using the preferred
tokenizer (osb, if I followed the maillinglist right) the time needed to
proces a message increases pretty much. Looking at the final batch there
is an increase from chain to osb by 115 seconds, 27 to 142.
The biggest
wait was using sbph, luckily i went out to jam with the band ;). Stuppid
as I am, I didn't automaticly start the final batch, so at this moment I
am waiting again.
My personal experience training with this mail batch
(i've got another 3K+ trainingmessages) is not to good :(. Using sbph my
db filed up my disc. I did not expect it to grow that big. So, I started
using osb, but then I got too much FP's, and as stated before: a lot of
unhappy faces.
Well, in the final batch there are only two with no
FP's, the osb and sbph. sbph is 'the best' but not realy suited for
bussy systems. Or you just buy more power... Maybe I will just train a
lot more and return to the chain. One thing I noticed: average time per
message in the final batch.
word: 0.04667 seconds
chain: 0.09
seconds
osb: 0.47333 seconds
sbph: 4.49333 seconds
Hope it helps
someone.
Greetings,
Ed van der Salm
The Netherlands
Amstelveen
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network
management toolset available today. Delivers lowest initial
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user