Re: [Dspam-user] Fwd: Re: Some tokenizer statistics

Ed van der Salm Thu, 05 May 2011 16:03:14 -0700

Hi,

The files i used: http://www.vander-salm.nl/downloads/retrain-all.tar.bz2
I added my last dspam.conf to. The logfile is filled with the last results.


I have run the osb tokenizer using TOE, and the result were identical, even the 
number of tokens in the database.
First run:
Token / part: 01   02      03 04     05     final total     tokens@db
osb: FP:          1         1        3         3       0        0            8  
          2741757
         FN:         74      73      18      11      13       4         193

After making the above file i retrained using the same script, then the results 
were different:
Second run osb, not clearing the database:
Token / part:    01      02         03    04     05        final    total       
 tokens@db
osb: FP:             1         0         0         1       0          0         
   2             2741981
         FN:             2         4         2         1       1          1     
       11
This is better, still one FN in the final 100 spam's

Third run osb, not clearing the database:
Token / part:    01      02         03    04     05        final    total       
 tokens@db
osb: FP:             1         0         0         1       0          0         
   2             2742036
         FN:             1         2         0         0       1          0     
       4

OK, 6 misses with 6300 mails. Makes me wonder which mails are so problematic. 
Yeah, the final batch had no misses.
I will upload the last logfile: http://www.vander-salm.nl/downloads/logfile.txt.

The time needed at the last run was about 30% less then the second run.

More tomorrow! I think...

Greetings,
Ed.


Op Donderdag, 05-05-2011 om 15:32 schreef Stevan Bajić:
> On Thu, 05 May 2011 15:14:56 +0200, Ed van der Salm wrote:
> 
> > Since it really looked like a mess, a repost, with the extra info 
> > added:
> >
> >> I installed a virtual machine which i restored after each run.
> >> 'each run' = after 5 batches + the final run
> >> OR
> >> 'each run' = after switching tokenizer?
> >
> > I run all 6300 messages through the training, then i restored the 
> > machine
> > to use another tokenizer and do it again. So the start was always a 
> > clean
> > machine.
> >
>  Thanks for the clarification.
> 
> 
> >> TEFT is good suited for the dull tokenizers like WORD and CHAIN
> >> (default). When using one of the more intelligent tokenizers
> >> (aka: OSB/SBPH) then using TOE is better (TUM would work too).
> > Should i do the training using TOE?
> >
>  For OSB and SBPH you could try with TOE. That should deliver in the 
>  long run better results than TEFT.
> 
>  
> > Ah well, if i am home in time, I will change to TOE and rerun them 
> > all.
> > More numbers are always good!
> >
>  If you want you could send me the training data and I will do the tests 
>  with my own training method and then post the results.
> 
>  One thing you could do as well is after you have done the whole 
>  training then run a classification run over all 5 sets and include the 
>  final set too and then record how many FP/FN you had. This would show 
>  how well the training was in regards of classifying the same message set 
>  AFTER the whole training.
> 
>  Another test you could do is you do the same training as now and after 
>  you are finished with the training you go on and switch the two classes. 
>  So you declare AFTER the whole training, every SPAM message to be HAM 
>  and every HAM message to be SPAM. And then you do the training again and 
>  look how quick the tokenizer is able to switch the tokens the other way 
>  around. You run that learning until ALL messages are correctly 
>  classified (aka: 0 FP and 0 FN). Good learning algorithm will not need 
>  that much time/training to switch while bad algorithm will use a lot of 
>  training to switch. Doing that kind of tests is where TOE shines 
>  (compared to TEFT). TOE will use much less training while some messages 
>  trained with TEFT will use an insane amount of re-training until they 
>  switch their class.
> 
> 
> > Greetings
> >
> > @
> >
> -- 
>  Kind Regards from Switzerland,
> 
>  Stevan Bajić
> 
> 
> > --- Origineel bericht volgt ---
> > ONDERWERP: Re: [Dspam-user] Some tokenizer statistics
> > VAN: Ed van der Salm
> > NAAR: "Stevan Bajić"
> > DATUM: 05-05-2011 14:14
> >
> > Since I'm not behind the machine (it's at home) for now only the info 
> > I
> > know.
> >
> > I installed a virtual machine which i restored after each run. I left 
> > as
> > much settings as possible at default, so the learning method was TEFT
> > (that's default, right?) and all other settings were untouched (apart
> > from using MySQL db). Since i wanted to check dspam tokenizers, I 
> > left
> > out all other stuff like DNSBL and AV.
> > I have not realy checked (myself I mean) all spams/hams they came 
> > from a
> > file which could be used for Mailscanner training. Looking at the
> > webinterface, it looks like the all are English. It also looked like 
> > they
> > were mail from maillinglists etceterra. If someone has a nice batch 
> > of
> > 'real' mail to throw at it, just send me the zip... :)
> >
> > (It seems like my webmail-client doesn't allow 'nice' inserts, so 
> > I've
> > put my comments between ------------------------)
> >
> > Op Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:
> >
> >> On Thu, 05 May 2011 10:22:12 +0200, Ed van der Salm wrote:
> >>
> >>> Hi all,
> >>>
> >>> (Maybe a monospaced font makes it more readable)
> >>> There seems to be some questions about the tokenizer to use (for me
> >>> to),
> >>> so I thought it would be nice to have some statistics.
> >>>
> >>> First about the setup I've chosen:
> >>> It's a clean install with no training done at al. I've made 6
> >>> directories
> >>> containing spam and 6 containing ham. I thought I've read somewhere
> >>> to
> >>> train 2 ham agains 1 spam so in those directories the number of 
> >>> files
> >>> are:
> >>> ham-01: 500
> >>> ham-02: 1000
> >>> ham-03: 1000
> >>> ham-04: 1000
> >>> ham-05: 500
> >>> ham-final: 200
> >>> spam-01: 250
> >>> spam-02: 500
> >>> spam-03: 500
> >>> spam-04: 500
> >>> spam-05: 250
> >>> spam-final: 100
> >>> Totaling: 6300 messages, 2100 spam and 4200 ham.
> >>>
> >>> Some other info: Algorithm graham burton, and a MySQL database as
> >>> backend. There were only 55 'recent' spam messages. They came from 
> >>> my
> >>> gmail-account spambox. All other mails were training mails found
> >>> somewhere on the internet dating 2003, 2004 and 2005. In the final
> >>> batch
> >>> there were 10 of the recent spams, the other 45 were spread in the
> >>> other
> >>> batches.
> >>> This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. 
> >>> There
> >>> were
> >>> no other VM's running.
> >>> After that I trained using the word, chain, osb and sbph. I hope 
> >>> this
> >>> gives me the insight I want.
> >>>
> >>> So, now for the real deal:
> >>>
> >>> Token / batch: 01 02 03 04 05 final total tokens@db
> >>> word: FP: 0 0 2 2 0 1 5 205234
> >>> FN: 100 94 31 28 26 3 282
> >>> Time sec: 37 58 63 70 34 14 276 sec
> >>>
> >>> chain: FP: 0 0 3 2 0 1 6 825549
> >>> FN: 77 59 10 10 14 3 173
> >>> Time: 46 79 90 111 46 27 399 sec
> >>>
> >>> osb: FP: 1 1 3 3 0 0 8 2741757
> >>> FN: 74 73 18 11 13 4 193
> >>> Time: 80 126 218 469 397 142 1432 sec
> >>>
> >>> sbph: FP: 1 1 2 6 0 0 10 13904366
> >>> FN: 65 60 10 6 10 3 154
> >>> Time: 544 3272 6843 8936 3532 1348 6h47m55s
> >>>
> >>> Using osb my database grew up to 299Mb. Using sbph my database grew
> >>> up to
> >>> 741Mb. The last collumn shows the number of tokens produced.
> >>>
> >> Some questions:
> >> 1) What learning method have you used? TEFT? TOE? TUM?
> >> 2) Are those batches cumulative or have you wiped the data after 
> >> each
> >> training batch?
> >> 3) What preferences have you in place?
> >> 4) Have you done anything in between the training batches? (stuff 
> >> like
> >> using dspam_clean etc)
> >> 5) Have you used DNSBL within DSPAM?
> >> 6) Are the SPAM and HAM messages in the same language?
> >>
> >> ------------------------
> >> I think I answered these questions in my intro above, if you miss
> >> something, tell me, then I will look in to that. But that will be
> >> sometime late tonight.
> >> (So, for 6: I haven't realy looked...)
> >> ------------------------
> >>
> >>> What is this all telling me...
> >>>
> >>> That i'm a little disapointed. osb gave me more FP and FN than the
> >>> chain
> >>> tokenizer did.
> >>>
> >> Usually OSB/SBPH result in less training (FP/FN) in the long run 
> >> while
> >> WORD will constantly require training and CHAIN is somewhere in 
> >> between
> >>
> >> WORD and OSB/SBPH.
> >>
> >> ------------------------
> >> I would agree with that.
> >> The reason for splitting all messages in batches was to see the how 
> >> the
> >> ongoing training would change the results. And I think the figures 
> >> say
> >> the same as you. Just like I expected btw (but not as big of a
> >> difference as I expected)
> >> ------------------------
> >>
> >>> Luckely in the final batch there were no FP's. That's the
> >>> one thing people can't live with. This also means you will probably
> >>> need
> >>> arround 7500 ham messages and 3500+ (recent) spam messages to get a
> >>> proper training. The best training will be (dûh) using real mail.
> >>>
> >> I would not underwrite that statement. You can not conclude from 
> >> your
> >> test that in general one needs 7.5K messages to get a decent result. 
> >> It
> >>
> >> all depends what you train and how you train.
> >>
> >> ------------------------
> >> OK, true... As far as I could see 4.2K ham and 2.1K spam just wasn't
> >> enough. And since I saw better results after more training I thought
> >> lets just shout some numbers... If somebody want's all messages i 
> >> used,
> >> i can zip them and post it somewhere.
> >> ------------------------
> >>
> >>> What more have I learned: If you are using the preferred tokenizer
> >>> (osb,
> >>> if I followed the maillinglist right) the time needed to proces a
> >>> message
> >>> increases pretty much. Looking at the final batch there is an
> >>> increase
> >>> from chain to osb by 115 seconds, 27 to 142.
> >>>
> >> This strongly depends on the used training method. I think you used
> >> TEFT on both algorithms. Right?
> >> ------------------------
> >> Yep, I only used TEFT (That's the default, right?) Ow, one thing: 
> >> dspam
> >> 3.9.1rc1 :)
> >> ------------------------
> >>
> >>> The biggest wait was using sbph, luckily i went out to jam with the
> >>> band
> >>> ;). Stuppid as I am, I didn't automaticly start the final batch, so
> >>> at
> >>> this moment I am waiting again.
> >>>
> >>> My personal experience training with this mail batch (i've got
> >>> another
> >>> 3K+ trainingmessages) is not to good :(. Using sbph my db filed up 
> >>> my
> >>> disc. I did not expect it to grow that big. So, I started using 
> >>> osb,
> >>> but
> >>> then I got too much FP's, and as stated before: a lot of unhappy
> >>> faces.
> >>>
> >>> Well, in the final batch there are only two with no FP's, the osb 
> >>> and
> >>> sbph. sbph is 'the best' but not realy suited for bussy systems. Or
> >>> you
> >>> just buy more power... Maybe I will just train a lot more and 
> >>> return
> >>> to
> >>> the chain. One thing I noticed: average time per message in the 
> >>> final
> >>> batch.
> >>>
> >> SBPH is usually best done using the Hash storage driver. Using a 
> >> RDBMS
> >> for SBPH is sub optimal.
> >>
> >> ------------------------
> >> I kind off figured that out. But is the hash storage driver always
> >> faster? I used a DB because i wanted to see how much tokens would be
> >> created. So that was mainly for more statistics. If the hash storage
> >> driver is always faster, than a DB is not realy usefull for 
> >> standalone
> >> servers i suppose.
> >> ------------------------
> >>
> >>> word: 0.04667 seconds
> >>> chain: 0.09 seconds
> >>> osb: 0.47333 seconds
> >>> sbph: 4.49333 seconds
> >>>
> >>> Hope it helps someone.
> >>>
> >>> Greetings,
> >>>
> >>> Ed van der Salm
> >>>
> >>> The Netherlands
> >>> Amstelveen
> >>
> >> --
> >> Kind Regards from Switzerland,
> >>
> >> Stevan Bajić
> >>
> >> ------------------------
> >> Greetings!
> >> @
> >> ------------------------
> >>
> >>
> >
>  
> ------------------------------------------------------------------------------
> >> WhatsUp Gold - Download Free Network Management Software
> >> The most intuitive, comprehensive, and cost-effective network
> >> management toolset available today. Delivers lowest initial
> >> acquisition cost and overall TCO of any competing solution.
> >> http://p.sf.net/sfu/whatsupgold-sd [1]
> >> _______________________________________________
> >> Dspam-user mailing list
> >> Dspam-user@lists.sourceforge.net [2]
> >> https://lists.sourceforge.net/lists/listinfo/dspam-user [3]
> 
> 
> ------------------------------------------------------------------------------
> WhatsUp Gold - Download Free Network Management Software
> The most intuitive, comprehensive, and cost-effective network 
> management toolset available today.  Delivers lowest initial 
> acquisition cost and overall TCO of any competing solution.
> http://p.sf.net/sfu/whatsupgold-sd
> _______________________________________________
> Dspam-user mailing list
> Dspam-user@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-user


------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Fwd: Re: Some tokenizer statistics

Reply via email to