Re: [Dspam-user] Fwd: Re: Some tokenizer statistics

Stevan Bajić Thu, 05 May 2011 06:37:48 -0700


 On Thu, 05 May 2011 15:14:56 +0200, Ed van der Salm wrote:


> Since it really looked like a mess, a repost, with the extra info 
> added:
>
>> I installed a virtual machine which i restored after each run.
>> 'each run' = after 5 batches + the final run
>> OR
>> 'each run' = after switching tokenizer?
>
> I run all 6300 messages through the training, then i restored the 
> machine
> to use another tokenizer and do it again. So the start was always a 
> clean
> machine.
>
 Thanks for the clarification.


>> TEFT is good suited for the dull tokenizers like WORD and CHAIN
>> (default). When using one of the more intelligent tokenizers
>> (aka: OSB/SBPH) then using TOE is better (TUM would work too).
> Should i do the training using TOE?
>
 For OSB and SBPH you could try with TOE. That should deliver in the 
 long run better results than TEFT.

 
> Ah well, if i am home in time, I will change to TOE and rerun them 
> all.
> More numbers are always good!
>
 If you want you could send me the training data and I will do the tests 
 with my own training method and then post the results.

 One thing you could do as well is after you have done the whole 
 training then run a classification run over all 5 sets and include the 
 final set too and then record how many FP/FN you had. This would show 
 how well the training was in regards of classifying the same message set 
 AFTER the whole training.

 Another test you could do is you do the same training as now and after 
 you are finished with the training you go on and switch the two classes. 
 So you declare AFTER the whole training, every SPAM message to be HAM 
 and every HAM message to be SPAM. And then you do the training again and 
 look how quick the tokenizer is able to switch the tokens the other way 
 around. You run that learning until ALL messages are correctly 
 classified (aka: 0 FP and 0 FN). Good learning algorithm will not need 
 that much time/training to switch while bad algorithm will use a lot of 
 training to switch. Doing that kind of tests is where TOE shines 
 (compared to TEFT). TOE will use much less training while some messages 
 trained with TEFT will use an insane amount of re-training until they 
 switch their class.


> Greetings
>
> @
>
-- 
 Kind Regards from Switzerland,

 Stevan Bajić


> --- Origineel bericht volgt ---
> ONDERWERP: Re: [Dspam-user] Some tokenizer statistics
> VAN: Ed van der Salm
> NAAR: "Stevan Bajić"
> DATUM: 05-05-2011 14:14
>
> Since I'm not behind the machine (it's at home) for now only the info 
> I
> know.
>
> I installed a virtual machine which i restored after each run. I left 
> as
> much settings as possible at default, so the learning method was TEFT
> (that's default, right?) and all other settings were untouched (apart
> from using MySQL db). Since i wanted to check dspam tokenizers, I 
> left
> out all other stuff like DNSBL and AV.
> I have not realy checked (myself I mean) all spams/hams they came 
> from a
> file which could be used for Mailscanner training. Looking at the
> webinterface, it looks like the all are English. It also looked like 
> they
> were mail from maillinglists etceterra. If someone has a nice batch 
> of
> 'real' mail to throw at it, just send me the zip... :)
>
> (It seems like my webmail-client doesn't allow 'nice' inserts, so 
> I've
> put my comments between ------------------------)
>
> Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:
>
>> On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:
>>
>>> Hi all,
>>>
>>> (Maybe a monospaced font makes it more readable)
>>> There seems to be some questions about the tokenizer to use (for me
>>> to),
>>> so I thought it would be nice to have some statistics.
>>>
>>> First about the setup I've chosen:
>>> It's a clean install with no training done at al. I've made 6
>>> directories
>>> containing spam and 6 containing ham. I thought I've read somewhere
>>> to
>>> train 2 ham agains 1 spam so in those directories the number of 
>>> files
>>> are:
>>> ham-01: 500
>>> ham-02: 1000
>>> ham-03: 1000
>>> ham-04: 1000
>>> ham-05: 500
>>> ham-final: 200
>>> spam-01: 250
>>> spam-02: 500
>>> spam-03: 500
>>> spam-04: 500
>>> spam-05: 250
>>> spam-final: 100
>>> Totaling: 6300 messages, 2100 spam and 4200 ham.
>>>
>>> Some other info: Algorithm graham burton, and a MySQL database as
>>> backend. There were only 55 'recent' spam messages. They came from 
>>> my
>>> gmail-account spambox. All other mails were training mails found
>>> somewhere on the internet dating 2003, 2004 and 2005. In the final
>>> batch
>>> there were 10 of the recent spams, the other 45 were spread in the
>>> other
>>> batches.
>>> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. 
>>> There
>>> were
>>> no other VM's running.
>>> After that I trained using the word, chain, osb and sbph. I hope 
>>> this
>>> gives me the insight I want.
>>>
>>> So, now for the real deal:
>>>
>>> Token / batch: 01 02 03 04 05 final total tokens@db
>>> word: FP: 0 0 2 2 0 1 5 205234
>>> FN: 100 94 31 28 26 3 282
>>> Time sec: 37 58 63 70 34 14 276 sec
>>>
>>> chain: FP: 0 0 3 2 0 1 6 825549
>>> FN: 77 59 10 10 14 3 173
>>> Time: 46 79 90 111 46 27 399 sec
>>>
>>> osb: FP: 1 1 3 3 0 0 8 2741757
>>> FN: 74 73 18 11 13 4 193
>>> Time: 80 126 218 469 397 142 1432 sec
>>>
>>> sbph: FP: 1 1 2 6 0 0 10 13904366
>>> FN: 65 60 10 6 10 3 154
>>> Time: 544 3272 6843 8936 3532 1348 6h47m55s
>>>
>>> Using osb my database grew up to 299Mb. Using sbph my database grew
>>> up to
>>> 741Mb. The last collumn shows the number of tokens produced.
>>>
>> Some questions:
>> 1) What learning method have you used? TEFT? TOE? TUM?
>> 2) Are those batches cumulative or have you wiped the data after 
>> each
>> training batch?
>> 3) What preferences have you in place?
>> 4) Have you done anything in between the training batches? (stuff 
>> like
>> using dspam_clean etc)
>> 5) Have you used DNSBL within DSPAM?
>> 6) Are the SPAM and HAM messages in the same language?
>>
>> ------------------------
>> I think I answered these questions in my intro above, if you miss
>> something, tell me, then I will look in to that. But that will be
>> sometime late tonight.
>> (So, for 6: I haven't realy looked...)
>> ------------------------
>>
>>> What is this all telling me...
>>>
>>> That i'm a little disapointed. osb gave me more FP and FN than the
>>> chain
>>> tokenizer did.
>>>
>> Usually OSB/SBPH result in less training (FP/FN) in the long run 
>> while
>> WORD will constantly require training and CHAIN is somewhere in 
>> between
>>
>> WORD and OSB/SBPH.
>>
>> ------------------------
>> I would agree with that.
>> The reason for splitting all messages in batches was to see the how 
>> the
>> ongoing training would change the results. And I think the figures 
>> say
>> the same as you. Just like I expected btw (but not as big of a
>> difference as I expected)
>> ------------------------
>>
>>> Luckely in the final batch there were no FP's. That's the
>>> one thing people can't live with. This also means you will probably
>>> need
>>> arround 7500 ham messages and 3500+ (recent) spam messages to get a
>>> proper training. The best training will be (dûh) using real mail.
>>>
>> I would not underwrite that statement. You can not conclude from 
>> your
>> test that in general one needs 7.5K messages to get a decent result. 
>> It
>>
>> all depends what you train and how you train.
>>
>> ------------------------
>> OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't
>> enough. And since I saw better results after more training I thought
>> lets just shout some numbers... If somebody want's all messages i 
>> used,
>> i can zip them and post it somewhere.
>> ------------------------
>>
>>> What more have I learned: If you are using the preferred tokenizer
>>> (osb,
>>> if I followed the maillinglist right) the time needed to proces a
>>> message
>>> increases pretty much. Looking at the final batch there is an
>>> increase
>>> from chain to osb by 115 seconds, 27 to 142.
>>>
>> This strongly depends on the used training method. I think you used
>> TEFT on both algorithms. Right?
>> ------------------------
>> Yep, I only used TEFT (That's the default, right?) Ow, one thing: 
>> dspam
>> 3.9.1rc1 :)
>> ------------------------
>>
>>> The biggest wait was using sbph, luckily i went out to jam with the
>>> band
>>> ;). Stuppid as I am, I didn't automaticly start the final batch, so
>>> at
>>> this moment I am waiting again.
>>>
>>> My personal experience training with this mail batch (i've got
>>> another
>>> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up 
>>> my
>>> disc. I did not expect it to grow that big. So, I started using 
>>> osb,
>>> but
>>> then I got too much FP's, and as stated before: a lot of unhappy
>>> faces.
>>>
>>> Well, in the final batch there are only two with no FP's, the osb 
>>> and
>>> sbph. sbph is 'the best' but not realy suited for bussy systems. Or
>>> you
>>> just buy more power... Maybe I will just train a lot more and 
>>> return
>>> to
>>> the chain. One thing I noticed: average time per message in the 
>>> final
>>> batch.
>>>
>> SBPH is usually best done using the Hash storage driver. Using a 
>> RDBMS
>> for SBPH is sub optimal.
>>
>> ------------------------
>> I kind off figured that out. But is the hash storage driver always
>> faster? I used a DB because i wanted to see how much tokens would be
>> created. So that was mainly for more statistics. If the hash storage
>> driver is always faster, than a DB is not realy usefull for 
>> standalone
>> servers i suppose.
>> ------------------------
>>
>>> word: 0.04667 seconds
>>> chain: 0.09 seconds
>>> osb: 0.47333 seconds
>>> sbph: 4.49333 seconds
>>>
>>> Hope it helps someone.
>>>
>>> Greetings,
>>>
>>> Ed van der Salm
>>>
>>> The Netherlands
>>> Amstelveen
>>
>> --
>> Kind Regards from Switzerland,
>>
>> Stevan Bajić
>>
>> ------------------------
>> Greetings!
>> @
>> ------------------------
>>
>>
>
 ------------------------------------------------------------------------------
>> WhatsUp Gold - Download Free Network Management Software
>> The most intuitive, comprehensive, and cost-effective network
>> management toolset available today. Delivers lowest initial
>> acquisition cost and overall TCO of any competing solution.
>> http://p.sf.net/sfu/whatsupgold-sd [1]
>> _______________________________________________
>> Dspam-user mailing list
>> Dspam-user@lists.sourceforge.net [2]
>> https://lists.sourceforge.net/lists/listinfo/dspam-user [3]


------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Fwd: Re: Some tokenizer statistics

Reply via email to