On Sunday 14 February 2010 12:25:40 Stevan Bajić wrote:
> On Sun, 14 Feb 2010 11:49:20 +0000
> 
> Kārlis Repsons <[email protected]> wrote:
> > I know it depends on quite many factors in total, but anyway, could we
> > make a small list of values and info in here like this:
> 
> what do you mean? We all here should submit our values? 
Presuming, that my variables list was sufficiently complete + significant to 
understand what total diskspace dspam can take up in what case -- yes! 
Otherwise correct it...

> I start with the values I use:
> > 1. storage type,
> 
> MySQL
> 
> > 2. total size,
> 
> Of the storage used? This depends. My current setup uses 334.8 MiB INNODB
>  data. I have clustered my DSPAM installation. Right now my main
>  installation runs MySQL 5.1.43 in Master/Master mode. I do however have
>  other installations that use PostgreSQL and other databases.
> 
> > 3. number of people, who contribute to create dspam data,
> 
> On my setup: A few 100 domains.
But what does "domain" mean there?
Well, if 100 people with the same username add tokens, 334.8MiB is not much, 
considering I've read in comments of [1] about several GiB... I guess, they 
used individual tokens for each user...

> * The data from the merged group is comming from +/- 5 millions of spam
>  mails and 3 millions of ham mails.
So, all users are sharing and extending the same ... well, database username, 
I'd say (share tokens effectively)? And that database doesn't seem to grow any 
larger than some over 300M?

>  * Training of the merged group is done
>  with a honeypot that captures spam and by feeding some outbound mail as
>  ham and processing other sources for ham (news groups, etc. Normalizing
>  training data using "boosting" techniques and only training messages that
>  are in a certain threshold and are all checked first by me before DSPAM is
>  allowed to train them. Training corpi does not have any newsletters and
>  such things (I don't train newsletters)). * Training is done using custom
>  made script that uses TONE (train on error or near error) technique and
>  uses a asymetric thickness for ham/spam and does double sided training (a
>  technique used to boost accuracy).
Am I right in thinking, that database size is greatly affected by the used 
training mode (teft, toe, tum), teft making the largest database? Maybe in 
practice, since there has to be some serious initial training, the difference 
is not that large..?

> > 7. Tokenizer?
> 
> OSB
Can it be easily estimated what are the relative coefficients of database size 
per tokenizer? Say, if "word" is 1, then chain 10, but sbph 200?


[1] http://www.howtoforge.com/optimizing_dspam_mysql4.1

Attachment: signature.asc
Description: This is a digitally signed message part.

------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to