Re: [Dspam-user] Some tokenizer statistics

Stevan Bajić Thu, 05 May 2011 06:19:43 -0700

  

I know, I know.... have resent the message with correct formatting.



On Thu, 05 May 2011 09:02:16 -0400, Julien Vehent wrote: 

> HTML
MESSAGES ARE EVIL !
> 
> use plain text please !
> 
> (and no I'm not
using mutt, that html bloab below was displayed on 
> roundcube)
> 
> On
Thu, 05 May 2011 14:43:30 +0200, Stevan Bajić wrote:
>> On Thu, 05 May
2011 14:14:30 +0200, Ed van der Salm wrote: 
>> 
>>> Since I'm not
behind the machine (it's at home) for now only the info I know.
>> 
>>
Okay. 
>> 
>>> I installed a virtual machine which i restored after each
run.
>> 
>> 'each run' = after 5 batches + the final run 
>> 
>> OR 
>>

>> 'each run' = after switching tokenizer? 
>> 
>> I left as much
settings as possible at default, so the learning method was 
>> 
>>> ll
tokenizers like WORD and CHAIN (default). When using one of the more
intelligent tokenizers (aka: OSB/SBPH) then using TOE is better (TUM
would w
>>> 
>> 
>> and all other settings were untouched (apart from
using MySQL db). Since i wanted to check dspam tokenizers, I left out
all other stuff like DNSBL and AV. 
>> 
>> Okay. 
>> 
>> ="font-family:
arial;">I have not realy checked (myself I mean) all spams/hams they
came from a file which could be used for Mailscanner training. Looking
at the webinterface, it looks like the
>> sh. It also looked like they
were mail from maillinglists etceterra. If someone has a nice batch of
'real' mail to throw at it, just send
>> 
>>> be helpful. I think you by
mistake just wrote the message to me. 
>>> 
>>>> (It seems like my
webmail-client doesn't allow 'nice' inserts, so I've put my comments
between ------------------------)
>>> 
>>> Okay.
>> t:5px;
border-left:#1010ff 2px solid; margin-left:5px; width:100%">Op
Donderdag, 05-05-2011 om 11:43 schreef Stevan Bajić:
>> ding-bottom:
0px; margin: 0px 0px 0px 5px; padding-left: 5px; padding-right: 0px;
border-top: 0px; border-right: 0px; padding-top:
>> , 05 May 2011
10:22:12 +0200, Ed van der Salm wrote:
>> 
>>> Hi all,
>>> 
>>> (Maybe a
monospaced font makes it more readable)
>>> There seems to be some
questions about the tokenizer to use (for me 
>>> to),
>>> so I thought
it would be nice to have some statistics.
>> 
>>> t al. I've made 6

>>>> directories
>>>> containing s
>> taining ham. I thought I've read
somewhere 
>>> to
>>> train 2 ham agains 1 spam so in those di
>> 
>>>
1000
>>>> ham-04: 1000
>>>> ham-05: 500
>>>> ham-final: 200
>>>>
spam-01: 250
>>>> spam-02: 500
>>>> spam-03: 500
>>>> spam-04: 500
>>>>
spam-05: 250
>>>> spam-final: 100
>>>> Totaling: 6300 messages, 2100
spam and 4200 ham.
>>>> 
>>>> Some other info: Algorithm graham burton,
and a MySQL database as
>>>> backend. There were only 55 'recent' spam
messages. They came from my
>>>> gmail-account spambox. All other mails
were training mails found
>>>> somewhere on the internet dating 2003,
2004 and 2005. In the final 
>>>> batch
>>>> there were 10 of the recent
spams, the other 45 were spread in the 
>>>> other
>>>> batches.
>>>>
This all was done on a KVM virtual machine, 1 cpu and 1Gb mem. There

>>>> were
>>>> no other VM's running.
>>>> After that I trained using
the word, chain, osb and sbph. I hope this
>>>> gives me the insight I
want.
>>>> 
>>>> So, now for the real deal:
>>>> 
>>>> Token / batch: 01
02 03 04 05 final total tokens@db
>>>> word: FP: 0 0 2 2 0 1 5
205234
>>>> FN: 100 94 31 28 26 3 282
>>>> Time sec: 37 58 63 70 34 14
276 sec
>>>> 
>>>> chain: FP: 0 0 3 2 0 1 6 825549
>>>> FN: 77 59 10 10
14 3 173
>>>> Time: 46 79 90 111 46 27 399 sec
>>>> 
>>>> osb: FP: 1 1 3
3 0 0 8 2741757
>>>> FN: 74 73 18 11 13 4 193
>>>> Time: 80 126 218 469
397 142 1432 sec
>>>> 
>>>> sbph: FP: 1 1 2 6 0 0 10 13904366
>>>> FN:
65 60 10 6 10 3 154
>>>> Time: 544 3272 6843 8936 3532 1348
6h47m55s
>>>> 
>>>> Using osb my database grew up to 299Mb. Using sbph
my database grew 
>>>> up to
>>>> 741Mb. The last collumn shows the
number of tokens produced.
>>>> 
>>> Some questions:
>>> 1) What
learning method have you used? TEFT? TOE? TUM?
>>> 2) Are those batches
cumulative or have you wiped the data after each 
>>> training
batch?
>>> 3) What preferences have you in place?
>>> 4) Have you done
anything in between the training batches? (stuff like 
>>> using
dspam_clean etc)
>>> 5) Have you used DNSBL within DSPAM?
>>> 6) Are the
SPAM and HAM messages in the same language?
>>> 
>>>
------------------------
>>> I think I answered these questions in my
intro above, if you miss something, tell me, then I will look in to
that. But that will be sometime late tonight.
>>> (So, for 6: I haven't
realy looked...)
>>> ------------------------ 
>>> 
>>> Yes. You have
done that in the above response. 
>>> 
>>> > What is this all telling
me...
>>>> 
>>>> That i'm a little disapointed. osb gave me more FP and
FN than the 
>>>> chain
>>>> tokenizer did.
>>>> 
>>> Usually OSB/SBPH
result in less training (FP/FN) in the long run while 
>>> WORD will
constantly require training and CHAIN is somewhere in between 
>>> WORD
and OSB/SBPH.
>>> 
>>> ------------------------
>>> I would agree with
that.
>>> The reason for splitting
>> es in batches was to see the how
the ongoing training would change the results. And I think the figures
say the same as you. Just like I expected btw (but not as big of a
difference as I expected)
>> ------------------------
>> 
>>> Luckely in
the f
>> 
>>> will probably 
>>>> need
>>>> arround 7500 ham messages
and 3500+ (recent) spam messages to get a
>>>> proper training. The best
training will be (dûh) using real mail.
>>>> 
>>> I would not underwrite
that statement. You can not conclude from your 
>>> test that in general
one needs 7.5K messages to get a decent result. It 
>>> all depends what
you train and how you train.
>>> 
>>> ------------------------
>>> OK,
true... As far as I could see 4.2K ham and 2.1K spam just wasn't enough.
And since I saw better results after more training I thought lets just
shout some numbers... If somebody want's all messages i used, i can zip
them and post it somewhere.
>>> ------------------------
>>> 
>>>> What
more have I learned: If you are using the preferred tokenizer 
>>>>
(osb,
>>>> if I followed the maillinglist right) the time needed to
proces a 
>>>> message
>>>> increases pretty much. Looking at the final
batch there is an 
>>>> increase
>>>> from chain to osb by 115 seconds,
27 to 142.
>>>> 
>>> This strongly depends on the used training method.
I think you used 
>>> TEFT on both algorithms. Right?
>>>
------------------------
>>> Yep, I only used TEFT (That's the default,
right?) Ow, one thing: dspam 3.9.1rc1 :)
>>>
------------------------
>>> 
>>> Yes. TEFT is the default. 
>>> 
>>>>
The biggest wait was using sbph, luckily i went out to jam with the

>>>> band
>>>> ;). Stuppid as I am, I didn't automaticly start the
final batch, so 
>>>> at
>>>> this moment I am waiting again.
>>>> 
>>>>
My personal experience training with this mail batch (i've got 
>>>>
another
>>>> 3K+ trainingmessages) is not to good :(. Using sbph my db
filed up my
>>>> disc. I did not expect it to grow that big. So, I
started using osb, 
>>>> but
>>>> then I got too much FP's, and as
stated before: a lot of unhappy 
>>>> faces.
>>>> 
>>>> Well, in the
final batch there are o
>> no FP's, the osb and
>>> sbph. sbph is 'the
best' but not realy suited for bussy systems. Or 
>>> you
>>> just buy
more power... Maybe I will just train a lot more and return 
>>> to
>>>
the chain. One thing I noticed: average time per message in the
final
>>> batch.
>>> 
>> SBPH is usually best done using the Hash
storage driver. Using a RDBMS 
>> for SBPH is sub optimal.
>> 
>>
------------------------
>> I kind off figured that out. But is the hash
storage driver always faster? I used a DB because i wanted to see how
much tokens would be created. So that was mainly for more statistics. If
the hash storage driver is always faster, than a DB is not realy usefull
for standalone ser
>> 
>>> ="font-family: arial;">Hmm... technically the
Hash driver is +/- nothing other than a memory mapped file. A RDBMS is
way more than just that. So you can make your own conclusion which one
should be technically faster. 
>>> 
>>> > word: 0.04667 seconds
>>>>
chain: 0.09 seconds
>>>> osb: 0.47333 seconds
>>>> sbph: 4.49333
seconds
>>>> 
>>>> Hope it helps someone.
>>>> 
>>>> Greetings,
>>>>

>>>> Ed van der Salm
>>>> 
>>>> The Netherlands
>>>> Amstelveen
>>>

>>> -- 
>>> Kind Regards from Switzerland,
>>> 
>>> Stevan Bajić
>>>

>>> ------------------------
>>> Greetings!
>>> @
>>>
------------------------
>>> 
>>>
------------------------------------------------------------------------------
>>>
WhatsUp Gold - Download Free Network Management Software
>>> The most
intuitive, comprehensive, and cost-effective network 
>>> management
toolset available today. Delivers lowest initial 
>>> acquisition cost
and overall TCO of any competing solution.
>>>
http://p.sf.net/sfu/whatsupgold-sd [1]
>>>
_______________________________________________
>>> Dspam-user mailing
list
>>
=">Dspam-user@lists.sourceforge.net">Dspam-user@lists.sourceforge.net
[4]
>> https://lists.sourceforge.net/lists/listinfo/dspam-user [5] 
>>

>> -- 
>> Kind Regards from Switzerland,
>> 
>> Stevan
>> Bajić
>> 
>>
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software The most
intuitive, comprehensive, and cost-effective network management toolset
available today. Delivers lowest initial acquisition cost and ove
>>

>>> old-sd _______________________________________________ Dspam-user
mailing list Dspam-user@lists.sourceforge.net [2]
https://lists.sourceforge.net/lists/listinfo/dspam-user [3]
>
------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software The most
intuitive, comprehensive, and cost-effective network management toolset
available today. Delivers lowest initial acquisition cost and overall
TCO of any competing solution. http://p.sf.net/sfu/whatsupgold-sd [6]
_______________________________________________ Dspam-user mailing list
Dspam-user@lists.sourceforge.net [7]
https://lists.sourceforge.net/lists/listinfo/dspam-user [8]

-- 
Kind
Regards from Switzerland,

Stevan Bajić
  

Links:
------
[1]
http://theia.bajic.name/>http://p.sf.net/sfu/whatsupgold-sd</a>
[2]
mailto:Dspam-user@lists.sourceforge.net
[3]
https://lists.sourceforge.net/lists/listinfo/dspam-user
[4]
mailto:Dspam-user@lists.sourceforge.net
[5]
http://theia.bajic.name/>https://lists.sourceforge.net/lists/listinfo/dspam-user</a>
[6]
http://p.sf.net/sfu/whatsupgold-sd
[7]
mailto:Dspam-user@lists.sourceforge.net
[8]
https://lists.sourceforge.net/lists/listinfo/dspam-user

------------------------------------------------------------------------------
WhatsUp Gold - Download Free Network Management Software
The most intuitive, comprehensive, and cost-effective network 
management toolset available today.  Delivers lowest initial 
acquisition cost and overall TCO of any competing solution.
http://p.sf.net/sfu/whatsupgold-sd

_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Re: [Dspam-user] Some tokenizer statistics

Reply via email to