[Dspam-user] Fwd: Re: Some tokenizer statistics

Ed van der Salm Thu, 05 May 2011 06:22:17 -0700

Since it really looked like a mess, a repost, with the extra info
added:

>I installed a virtual machine which i restored after each run.
>'each run' = after 5 batches + the final run
>OR
>'each run' = after switching tokenizer?


I run all 6300 messages through the training, then i restored the
machine to use another tokenizer and do it again. So the start was
always a clean machine.

>TEFT is good suited for the dull tokenizers like WORD and CHAIN 
>(default). When using one of the more intelligent tokenizers 
>(aka: OSB/SBPH) then using TOE is better (TUM would work too).
Should i do the training using TOE? 
Ah well, if i am home in time, I will change to TOE and rerun them
all. More numbers are always good!

Greetings

@

--- Origineel bericht volgt ---
ONDERWERP: Re: [Dspam-user] Some tokenizer statistics
VAN:  Ed van der Salm 
NAAR: "Stevan Bajić" 
DATUM: 05-05-2011 14:14

Since I'm not behind the machine (it's at home) for now only the info
I know.

I installed a virtual machine which i restored after each run. I left
as much settings as possible at default, so the learning method was
TEFT (that's default, right?) and all other settings were untouched
(apart from using MySQL db). Since i wanted to check dspam tokenizers,
I left out all other stuff like DNSBL and AV.
I have not realy checked (myself I mean) all spams/hams they came from
a file which could be used for Mailscanner training. Looking at the
webinterface, it looks like the all are English. It also looked like
they were mail from maillinglists etceterra. If someone has a nice
batch of 'real' mail to throw at it, just send me the zip... :)

(It seems like my webmail-client doesn't allow 'nice' inserts, so
I've put my comments between ------------------------)

Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:

On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:

> Hi all,
>
> (Maybe a monospaced font makes it more readable)
> There seems to be some questions about the tokenizer to use (for me 
> to),
> so I thought it would be nice to have some statistics.
>
> First about the setup I've chosen:
> It's a clean install with no training done at al. I've made 6 
> directories
> containing spam and 6 containing ham. I thought I've read somewhere 
> to
> train 2 ham agains 1 spam so in those directories the number of
files
> are:
> ham-01: 500
> ham-02: 1000
> ham-03: 1000
> ham-04: 1000
> ham-05: 500
> ham-final: 200
> spam-01: 250
> spam-02: 500
> spam-03: 500
> spam-04: 500
> spam-05: 250
> spam-final: 100
> Totaling: 6300 messages, 2100 spam and 4200 ham.
>
> Some other info: Algorithm graham burton, and a MySQL database as
> backend. There were only 55 'recent' spam messages. They came from
my
> gmail-account spambox. All other mails were training mails found
> somewhere on the internet dating 2003, 2004 and 2005. In the final 
> batch
> there were 10 of the recent spams, the other 45 were spread in the 
> other
> batches.
> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There

> were
> no other VM's running.
> After that I trained using the word, chain, osb and sbph. I hope
this
> gives me the insight I want.
>
> So, now for the real deal:
>
> Token / batch: 01 02 03 04 05 final total tokens@db
> word: FP: 0 0 2 2 0 1 5 205234
> FN: 100 94 31 28 26 3 282
> Time sec: 37 58 63 70 34 14 276 sec
>
> chain: FP: 0 0 3 2 0 1 6 825549
> FN: 77 59 10 10 14 3 173
> Time: 46 79 90 111 46 27 399 sec
>
> osb: FP: 1 1 3 3 0 0 8 2741757
> FN: 74 73 18 11 13 4 193
> Time: 80 126 218 469 397 142 1432 sec
>
> sbph: FP: 1 1 2 6 0 0 10 13904366
> FN: 65 60 10 6 10 3 154
> Time: 544 3272 6843 8936 3532 1348 6h47m55s
>
> Using osb my database grew up to 299Mb. Using sbph my database grew 
> up to
> 741Mb. The last collumn shows the number of tokens produced.
>
Some questions:
1) What learning method have you used? TEFT? TOE? TUM?
2) Are those batches cumulative or have you wiped the data after each 
training batch?
3) What preferences have you in place?
4) Have you done anything in between the training batches? (stuff like

using dspam_clean etc)
5) Have you used DNSBL within DSPAM?
6) Are the SPAM and HAM messages in the same language?

------------------------
I think I answered these questions in my intro above, if you miss
something, tell me, then I will look in to that. But that will be
sometime late tonight.
(So, for 6: I haven't realy looked...)
------------------------

> What is this all telling me...
>
> That i'm a little disapointed. osb gave me more FP and FN than the 
> chain
> tokenizer did.
>
Usually OSB/SBPH result in less training (FP/FN) in the long run while

WORD will constantly require training and CHAIN is somewhere in
between 
WORD and OSB/SBPH.

------------------------
I would agree with that.
The reason for splitting all messages in batches was to see the how
the ongoing training would change the results. And I think the figures
say the same as you. Just like I expected btw (but not as big of a
difference as I expected)
------------------------

> Luckely in the final batch there were no FP's. That's the
> one thing people can't live with. This also means you will probably 
> need
> arround 7500 ham messages and 3500+ (recent) spam messages to get a
> proper training. The best training will be (dûh) using real mail.
>
I would not underwrite that statement. You can not conclude from your 
test that in general one needs 7.5K messages to get a decent result.
It 
all depends what you train and how you train.

------------------------
OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't
enough. And since I saw better results after more training I thought
lets just shout some numbers... If somebody want's all messages i
used, i can zip them and post it somewhere.
------------------------

> What more have I learned: If you are using the preferred tokenizer 
> (osb,
> if I followed the maillinglist right) the time needed to proces a 
> message
> increases pretty much. Looking at the final batch there is an 
> increase
> from chain to osb by 115 seconds, 27 to 142.
>
This strongly depends on the used training method. I think you used 
TEFT on both algorithms. Right?
------------------------
Yep, I only used TEFT (That's the default, right?) Ow, one thing:
dspam 3.9.1rc1 :)
------------------------

> The biggest wait was using sbph, luckily i went out to jam with the 
> band
> ;). Stuppid as I am, I didn't automaticly start the final batch, so 
> at
> this moment I am waiting again.
>
> My personal experience training with this mail batch (i've got 
> another
> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up
my
> disc. I did not expect it to grow that big. So, I started using osb,

> but
> then I got too much FP's, and as stated before: a lot of unhappy 
> faces.
>
> Well, in the final batch there are only two with no FP's, the osb
and
> sbph. sbph is 'the best' but not realy suited for bussy systems. Or 
> you
> just buy more power... Maybe I will just train a lot more and return

> to
> the chain. One thing I noticed: average time per message in the
final
> batch.
>
SBPH is usually best done using the Hash storage driver. Using a RDBMS

for SBPH is sub optimal.

------------------------
I kind off figured that out. But is the hash storage driver always
faster? I used a DB because i wanted to see how much tokens would be
created. So that was mainly for more statistics. If the hash storage
driver is always faster, than a DB is not realy usefull for standalone
servers i suppose.
------------------------

> word: 0.04667 seconds
> chain: 0.09 seconds
> osb: 0.47333 seconds
> sbph: 4.49333 seconds
>
> Hope it helps someone.
>
> Greetings,
>
> Ed van der Salm
>
> The Netherlands
> Amstelveen

-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------
Greetings!
@
------------------------

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd

_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

[Dspam-user] Fwd: Re: Some tokenizer statistics

Reply via email to