Re: [Dspam-user] Some tokenizer statistics

Stevan Bajić Thu, 05 May 2011 06:08:17 -0700

 On Thu, 05 May 2011 14:14:30 +0200, Ed van der Salm wrote:

> Since I'm not behind the machine (it's at home) for now only the info 
> I
> know.
>
 Okay.



> I installed a virtual machine which i restored after each run.
>
 'each run' = after 5 batches + the final run

 OR

 'each run' = after switching tokenizer?


> I left as
> much settings as possible at default, so the learning method was TEFT
> (that's default, right?)
>
 Yes. TEFT is default. TEFT is good suited for the dull tokenizers like 
 WORD and CHAIN (default). When using one of the more intelligent 
 tokenizers (aka: OSB/SBPH) then using TOE is better (TUM would work 
 too).


> and all other settings were untouched (apart
> from using MySQL db). Since i wanted to check dspam tokenizers, I 
> left
> out all other stuff like DNSBL and AV.
>
 Okay.


> I have not realy checked (myself I mean) all spams/hams they came 
> from a
> file which could be used for Mailscanner training. Looking at the
> webinterface, it looks like the all are English. It also looked like 
> they
> were mail from maillinglists etceterra. If someone has a nice batch 
> of
> 'real' mail to throw at it, just send me the zip... :)
>
 Then addressing this message to the DSPAM mailing list would be 
 helpful. I think you by mistake just wrote the message to me.


> (It seems like my webmail-client doesn't allow 'nice' inserts, so 
> I've
> put my comments between ------------------------)
>
 Okay.



> Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:
>
>> On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:
>>
>>> Hi all,
>>>
>>> (Maybe a monospaced font makes it more readable)
>>> There seems to be some questions about the tokenizer to use (for me
>>> to),
>>> so I thought it would be nice to have some statistics.
>>>
>>> First about the setup I've chosen:
>>> It's a clean install with no training done at al. I've made 6
>>> directories
>>> containing spam and 6 containing ham. I thought I've read somewhere
>>> to
>>> train 2 ham agains 1 spam so in those directories the number of 
>>> files
>>> are:
>>> ham-01: 500
>>> ham-02: 1000
>>> ham-03: 1000
>>> ham-04: 1000
>>> ham-05: 500
>>> ham-final: 200
>>> spam-01: 250
>>> spam-02: 500
>>> spam-03: 500
>>> spam-04: 500
>>> spam-05: 250
>>> spam-final: 100
>>> Totaling: 6300 messages, 2100 spam and 4200 ham.
>>>
>>> Some other info: Algorithm graham burton, and a MySQL database as
>>> backend. There were only 55 'recent' spam messages. They came from 
>>> my
>>> gmail-account spambox. All other mails were training mails found
>>> somewhere on the internet dating 2003, 2004 and 2005. In the final
>>> batch
>>> there were 10 of the recent spams, the other 45 were spread in the
>>> other
>>> batches.
>>> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. 
>>> There
>>> were
>>> no other VM's running.
>>> After that I trained using the word, chain, osb and sbph. I hope 
>>> this
>>> gives me the insight I want.
>>>
>>> So, now for the real deal:
>>>
>>> Token / batch: 01 02 03 04 05 final total tokens@db
>>> word: FP: 0 0 2 2 0 1 5 205234
>>> FN: 100 94 31 28 26 3 282
>>> Time sec: 37 58 63 70 34 14 276 sec
>>>
>>> chain: FP: 0 0 3 2 0 1 6 825549
>>> FN: 77 59 10 10 14 3 173
>>> Time: 46 79 90 111 46 27 399 sec
>>>
>>> osb: FP: 1 1 3 3 0 0 8 2741757
>>> FN: 74 73 18 11 13 4 193
>>> Time: 80 126 218 469 397 142 1432 sec
>>>
>>> sbph: FP: 1 1 2 6 0 0 10 13904366
>>> FN: 65 60 10 6 10 3 154
>>> Time: 544 3272 6843 8936 3532 1348 6h47m55s
>>>
>>> Using osb my database grew up to 299Mb. Using sbph my database grew
>>> up to
>>> 741Mb. The last collumn shows the number of tokens produced.
>>>
>> Some questions:
>> 1) What learning method have you used? TEFT? TOE? TUM?
>> 2) Are those batches cumulative or have you wiped the data after 
>> each
>> training batch?
>> 3) What preferences have you in place?
>> 4) Have you done anything in between the training batches? (stuff 
>> like
>> using dspam_clean etc)
>> 5) Have you used DNSBL within DSPAM?
>> 6) Are the SPAM and HAM messages in the same language?
>>
>> ------------------------
>> I think I answered these questions in my intro above, if you miss
>> something, tell me, then I will look in to that. But that will be
>> sometime late tonight.
>> (So, for 6: I haven't realy looked...)
>> ------------------------
>>
 Yes. You have done that in the above response.


>>> What is this all telling me...
>>>
>>> That i'm a little disapointed. osb gave me more FP and FN than the
>>> chain
>>> tokenizer did.
>>>
>> Usually OSB/SBPH result in less training (FP/FN) in the long run 
>> while
>> WORD will constantly require training and CHAIN is somewhere in 
>> between
>>
>> WORD and OSB/SBPH.
>>
>> ------------------------
>> I would agree with that.
>> The reason for splitting all messages in batches was to see the how 
>> the
>> ongoing training would change the results. And I think the figures 
>> say
>> the same as you. Just like I expected btw (but not as big of a
>> difference as I expected)
>> ------------------------
>>
>>> Luckely in the final batch there were no FP's. That's the
>>> one thing people can't live with. This also means you will probably
>>> need
>>> arround 7500 ham messages and 3500+ (recent) spam messages to get a
>>> proper training. The best training will be (dûh) using real mail.
>>>
>> I would not underwrite that statement. You can not conclude from 
>> your
>> test that in general one needs 7.5K messages to get a decent result. 
>> It
>>
>> all depends what you train and how you train.
>>
>> ------------------------
>> OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't
>> enough. And since I saw better results after more training I thought
>> lets just shout some numbers... If somebody want's all messages i 
>> used,
>> i can zip them and post it somewhere.
>> ------------------------
>>
>>> What more have I learned: If you are using the preferred tokenizer
>>> (osb,
>>> if I followed the maillinglist right) the time needed to proces a
>>> message
>>> increases pretty much. Looking at the final batch there is an
>>> increase
>>> from chain to osb by 115 seconds, 27 to 142.
>>>
>> This strongly depends on the used training method. I think you used
>> TEFT on both algorithms. Right?
>> ------------------------
>> Yep, I only used TEFT (That's the default, right?) Ow, one thing: 
>> dspam
>> 3.9.1rc1 :)
>> ------------------------
>>
 Yes. TEFT is the default.


>>> The biggest wait was using sbph, luckily i went out to jam with the
>>> band
>>> ;). Stuppid as I am, I didn't automaticly start the final batch, so
>>> at
>>> this moment I am waiting again.
>>>
>>> My personal experience training with this mail batch (i've got
>>> another
>>> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up 
>>> my
>>> disc. I did not expect it to grow that big. So, I started using 
>>> osb,
>>> but
>>> then I got too much FP's, and as stated before: a lot of unhappy
>>> faces.
>>>
>>> Well, in the final batch there are only two with no FP's, the osb 
>>> and
>>> sbph. sbph is 'the best' but not realy suited for bussy systems. Or
>>> you
>>> just buy more power... Maybe I will just train a lot more and 
>>> return
>>> to
>>> the chain. One thing I noticed: average time per message in the 
>>> final
>>> batch.
>>>
>> SBPH is usually best done using the Hash storage driver. Using a 
>> RDBMS
>> for SBPH is sub optimal.
>>
>> ------------------------
>> I kind off figured that out. But is the hash storage driver always
>> faster? I used a DB because i wanted to see how much tokens would be
>> created. So that was mainly for more statistics. If the hash storage
>> driver is always faster, than a DB is not realy usefull for 
>> standalone
>> servers i suppose.
>> ------------------------
>>
 Hmm... technically the Hash driver is +/- nothing other than a memory 
 mapped file. A RDBMS is way more than just that. So you can make your 
 own conclusion which one should be technically faster.


>>> word: 0.04667 seconds
>>> chain: 0.09 seconds
>>> osb: 0.47333 seconds
>>> sbph: 4.49333 seconds
>>>
>>> Hope it helps someone.
>>>
>>> Greetings,
>>>
>>> Ed van der Salm
>>>
>>> The Netherlands
>>> Amstelveen
>>
>> --
>> Kind Regards from Switzerland,
>>
>> Stevan Bajić
>>
>> ------------------------
>> Greetings!
>> @
>> ------------------------
>>
>>
>
 ------------------------------------------------------------------------------
>> WhatsUp Gold - Download Free Network Management Software
>> The most intuitive, comprehensive, and cost-effective network
>> management toolset available today. Delivers lowest initial
>> acquisition cost and overall TCO of any competing solution.
>> http://p.sf.net/sfu/whatsupgold-sd [1]
>> _______________________________________________
>> Dspam-user mailing list
>> Dspam-user@lists.sourceforge.net [2]
>> https://lists.sourceforge.net/lists/listinfo/dspam-user [3]

-- 
 Kind Regards from Switzerland,

 Stevan Bajić
 

 Links:
 ------
 [1] http://p.sf.net/sfu/whatsupgold-sd
 [2] mailto:Dspam-user@lists.sourceforge.net
 [3] https://lists.sourceforge.net/lists/listinfo/dspam-user

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Some tokenizer statistics

Reply via email to