Re: [Dspam-user] Some tokenizer statistics

Stevan Bajić Thu, 05 May 2011 02:52:15 -0700

 On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:

> Hi all,
>
> (Maybe a monospaced font makes it more readable)
> There seems to be some questions about the tokenizer to use (for me 
> to),
> so I thought it would be nice to have some statistics.
>
> First about the setup I've chosen:
> It's a clean install with no training done at al. I've made 6 
> directories
> containing spam and 6 containing ham. I thought I've read somewhere 
> to
> train 2 ham agains 1 spam so in those directories the number of files
> are:
> ham-01: 500
> ham-02: 1000
> ham-03: 1000
> ham-04: 1000
> ham-05: 500
> ham-final: 200
> spam-01: 250
> spam-02: 500
> spam-03: 500
> spam-04: 500
> spam-05: 250
> spam-final: 100
> Totaling: 6300 messages, 2100 spam and 4200 ham.
>
> Some other info: Algorithm graham burton, and a MySQL database as
> backend. There were only 55 'recent' spam messages. They came from my
> gmail-account spambox. All other mails were training mails found
> somewhere on the internet dating 2003, 2004 and 2005. In the final 
> batch
> there were 10 of the recent spams, the other 45 were spread in the 
> other
> batches.
> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There 
> were
> no other VM's running.
> After that I trained using the word, chain, osb and sbph. I hope this
> gives me the insight I want.
>
> So, now for the real deal:
>
> Token / batch: 01 02 03 04 05 final total tokens@db
> word: FP: 0 0 2 2 0 1 5 205234
> FN: 100 94 31 28 26 3 282
> Time sec: 37 58 63 70 34 14 276 sec
>
> chain: FP: 0 0 3 2 0 1 6 825549
> FN: 77 59 10 10 14 3 173
> Time: 46 79 90 111 46 27 399 sec
>
> osb: FP: 1 1 3 3 0 0 8 2741757
> FN: 74 73 18 11 13 4 193
> Time: 80 126 218 469 397 142 1432 sec
>
> sbph: FP: 1 1 2 6 0 0 10 13904366
> FN: 65 60 10 6 10 3 154
> Time: 544 3272 6843 8936 3532 1348 6h47m55s
>
> Using osb my database grew up to 299Mb. Using sbph my database grew 
> up to
> 741Mb. The last collumn shows the number of tokens produced.
>
 Some questions:
 1) What learning method have you used? TEFT? TOE? TUM?
 2) Are those batches cumulative or have you wiped the data after each 
 training batch?
 3) What preferences have you in place?
 4) Have you done anything in between the training batches? (stuff like 
 using dspam_clean etc)
 5) Have you used DNSBL within DSPAM?
 6) Are the SPAM and HAM messages in the same language?



> What is this all telling me...
>
> That i'm a little disapointed. osb gave me more FP and FN than the 
> chain
> tokenizer did.
>
 Usually OSB/SBPH result in less training (FP/FN) in the long run while 
 WORD will constantly require training and CHAIN is somewhere in between 
 WORD and OSB/SBPH.


> Luckely in the final batch there were no FP's. That's the
> one thing people can't live with. This also means you will probably 
> need
> arround 7500 ham messages and 3500+ (recent) spam messages to get a
> proper training. The best training will be (dûh) using real mail.
>
 I would not underwrite that statement. You can not conclude from your 
 test that in general one needs 7.5K messages to get a decent result. It 
 all depends what you train and how you train.


> What more have I learned: If you are using the preferred tokenizer 
> (osb,
> if I followed the maillinglist right) the time needed to proces a 
> message
> increases pretty much. Looking at the final batch there is an 
> increase
> from chain to osb by 115 seconds, 27 to 142.
>
 This strongly depends on the used training method. I think you used 
 TEFT on both algorithms. Right?


> The biggest wait was using sbph, luckily i went out to jam with the 
> band
> ;). Stuppid as I am, I didn't automaticly start the final batch, so 
> at
> this moment I am waiting again.
>
> My personal experience training with this mail batch (i've got 
> another
> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up my
> disc. I did not expect it to grow that big. So, I started using osb, 
> but
> then I got too much FP's, and as stated before: a lot of unhappy 
> faces.
>
> Well, in the final batch there are only two with no FP's, the osb and
> sbph. sbph is 'the best' but not realy suited for bussy systems. Or 
> you
> just buy more power... Maybe I will just train a lot more and return 
> to
> the chain. One thing I noticed: average time per message in the final
> batch.
>
 SBPH is usually best done using the Hash storage driver. Using a RDBMS 
 for SBPH is sub optimal.


> word: 0.04667 seconds
> chain: 0.09 seconds
> osb: 0.47333 seconds
> sbph: 4.49333 seconds
>
> Hope it helps someone.
>
> Greetings,
>
> Ed van der Salm
>
> The Netherlands
> Amstelveen

-- 
 Kind Regards from Switzerland,

 Stevan Bajić

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Some tokenizer statistics

Reply via email to