> On Sunday 14 February 2010 12:25:40 Stevan Bajić wrote:
>> On Sun, 14 Feb 2010 11:49:20 +0000
>>
>> Kārlis Repsons <[email protected]> wrote:
>> > I know it depends on quite many factors in total, but anyway, could we
>> > make a small list of values and info in here like this:
>>
>> what do you mean? We all here should submit our values?
> Presuming, that my variables list was sufficiently complete + significant
> to
> understand what total diskspace dspam can take up in what case -- yes!
> Otherwise correct it...
>
>> I start with the values I use:
>> > 1. storage type,
>>
>> MySQL
>>
>> > 2. total size,
>>
>> Of the storage used? This depends. My current setup uses 334.8 MiB
>> INNODB
>>  data. I have clustered my DSPAM installation. Right now my main
>>  installation runs MySQL 5.1.43 in Master/Master mode. I do however have
>>  other installations that use PostgreSQL and other databases.
>>
>> > 3. number of people, who contribute to create dspam data,
>>
>> On my setup: A few 100 domains.
> But what does "domain" mean there?
>
On that setup I do process mail for a few 100 domains. Internet mail domains.


> Well, if 100 people with the same username add tokens, 334.8MiB is not
> much,
>
On my setup it's way, way, way more then just 100 users.


> considering I've read in comments of [1] about several GiB... I guess,
> they
> used individual tokens for each user...
>
They probably used TEFT and had a setup that lead to a lot of tokens added
daily. And they used an older version of DSPAM.

On my setup I have over 95% of users that almost never need to retrain.
Maybe once or twice a week/month per user. My growing rate regarding new
tokens is ultra low. Normally the DSPAM database is in the morning very
small (due the cleanup at 4am). Then during the day the table
"dspam_signature_data" grows since DSPAM is saving the degenerated
messages it processes into this table. And then again at 4am the table is
reduced to a lower size after running the purge script.

The database stays +/- in the same size range since long time (growing
very slowly). Regardless how much users I add to the setup.

But I do daily housekeeping. I clean, purge, etc.... I have no death
entries in the database. As soon as a domain owner deletes a user from his
domain then I delete the users DSPAM data as well. I like my database to
be clean.

And I don't just blindly force DSPAM to learn (with stuff like TEFT, etc).
I only train DSPAM on errors and I train DSPAM only on errors or near
errors when doing honeypot trainings. So I don't do any artifically forced
learning.


>> * The data from the merged group is comming from +/- 5 millions of spam
>>  mails and 3 millions of ham mails.
> So, all users are sharing and extending the same ...
>
No! This is not how merged groups work. Individual user tokens go into
their data pool. The merged group can not be updated/extended by the
users.
Read the DSPAM documentation for more info.


> well, database
> username,
> I'd say (share tokens effectively)? And that database doesn't seem to grow
> any
> larger than some over 300M?
>
There are many reasons why the database is not much over 300 MB in my case.


>>  * Training of the merged group is done
>>  with a honeypot that captures spam and by feeding some outbound mail as
>>  ham and processing other sources for ham (news groups, etc. Normalizing
>>  training data using "boosting" techniques and only training messages
>> that
>>  are in a certain threshold and are all checked first by me before DSPAM
>> is
>>  allowed to train them. Training corpi does not have any newsletters and
>>  such things (I don't train newsletters)). * Training is done using
>> custom
>>  made script that uses TONE (train on error or near error) technique and
>>  uses a asymetric thickness for ham/spam and does double sided training
>> (a
>>  technique used to boost accuracy).
> Am I right in thinking, that database size is greatly affected by the used
> training mode (teft, toe, tum),
>
No. You are not right. The effective database size is greatly affected by
what get's saved in the DSPAM database. You are right that TEFT is loading
the database with a lot of tokens. But so can TOE and TUM. The main reason
for big databases is that DSPAM admins out there set their maximum message
size in dspam.conf to a very high number and then use something insane
like TEFT and this then leads to DSPAM saving every message in the
database and then those admins don't run the purge script frequent enough
to clean the database from not needed data.


> teft making the largest database?
>
Yes. TEFT is by definition leading to more data in the database.


> Maybe in
> practice, since there has to be some serious initial training, the
> difference
> is not that large..?
>
Compared to others like TUM, TOE, NOTRAIN? Forgett it. TEFT is insane for
any one having a high inbound volume system.


>> > 7. Tokenizer?
>>
>> OSB
> Can it be easily estimated what are the relative coefficients of database
> size
> per tokenizer? Say, if "word" is 1, then chain 10, but sbph 200?
>
Read here how the tokenizers work and how much data they produce:
http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Tokenizers


> [1] http://www.howtoforge.com/optimizing_dspam_mysql4.1
>
That script is from 2006! If you use DSPAM 3.9.0 then you will see that
out of the box DSPAM is already faster purging then the above mentioned
SQL purge script.


> ------------------------------------------------------------------------------
> SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
> Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
> http://p.sf.net/sfu/solaris-dev2dev_______________________________________________
> Dspam-user mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspam-user
>



------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Dspam-user mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to