> On Sunday 14 February 2010 12:25:40 Stevan BajiÄ wrote: >> On Sun, 14 Feb 2010 11:49:20 +0000 >> >> KÄrlis Repsons <[email protected]> wrote: >> > I know it depends on quite many factors in total, but anyway, could we >> > make a small list of values and info in here like this: >> >> what do you mean? We all here should submit our values? > Presuming, that my variables list was sufficiently complete + significant > to > understand what total diskspace dspam can take up in what case -- yes! > Otherwise correct it... > >> I start with the values I use: >> > 1. storage type, >> >> MySQL >> >> > 2. total size, >> >> Of the storage used? This depends. My current setup uses 334.8 MiB >> INNODB >> data. I have clustered my DSPAM installation. Right now my main >> installation runs MySQL 5.1.43 in Master/Master mode. I do however have >> other installations that use PostgreSQL and other databases. >> >> > 3. number of people, who contribute to create dspam data, >> >> On my setup: A few 100 domains. > But what does "domain" mean there? > On that setup I do process mail for a few 100 domains. Internet mail domains.
> Well, if 100 people with the same username add tokens, 334.8MiB is not > much, > On my setup it's way, way, way more then just 100 users. > considering I've read in comments of [1] about several GiB... I guess, > they > used individual tokens for each user... > They probably used TEFT and had a setup that lead to a lot of tokens added daily. And they used an older version of DSPAM. On my setup I have over 95% of users that almost never need to retrain. Maybe once or twice a week/month per user. My growing rate regarding new tokens is ultra low. Normally the DSPAM database is in the morning very small (due the cleanup at 4am). Then during the day the table "dspam_signature_data" grows since DSPAM is saving the degenerated messages it processes into this table. And then again at 4am the table is reduced to a lower size after running the purge script. The database stays +/- in the same size range since long time (growing very slowly). Regardless how much users I add to the setup. But I do daily housekeeping. I clean, purge, etc.... I have no death entries in the database. As soon as a domain owner deletes a user from his domain then I delete the users DSPAM data as well. I like my database to be clean. And I don't just blindly force DSPAM to learn (with stuff like TEFT, etc). I only train DSPAM on errors and I train DSPAM only on errors or near errors when doing honeypot trainings. So I don't do any artifically forced learning. >> * The data from the merged group is comming from +/- 5 millions of spam >> mails and 3 millions of ham mails. > So, all users are sharing and extending the same ... > No! This is not how merged groups work. Individual user tokens go into their data pool. The merged group can not be updated/extended by the users. Read the DSPAM documentation for more info. > well, database > username, > I'd say (share tokens effectively)? And that database doesn't seem to grow > any > larger than some over 300M? > There are many reasons why the database is not much over 300 MB in my case. >> * Training of the merged group is done >> with a honeypot that captures spam and by feeding some outbound mail as >> ham and processing other sources for ham (news groups, etc. Normalizing >> training data using "boosting" techniques and only training messages >> that >> are in a certain threshold and are all checked first by me before DSPAM >> is >> allowed to train them. Training corpi does not have any newsletters and >> such things (I don't train newsletters)). * Training is done using >> custom >> made script that uses TONE (train on error or near error) technique and >> uses a asymetric thickness for ham/spam and does double sided training >> (a >> technique used to boost accuracy). > Am I right in thinking, that database size is greatly affected by the used > training mode (teft, toe, tum), > No. You are not right. The effective database size is greatly affected by what get's saved in the DSPAM database. You are right that TEFT is loading the database with a lot of tokens. But so can TOE and TUM. The main reason for big databases is that DSPAM admins out there set their maximum message size in dspam.conf to a very high number and then use something insane like TEFT and this then leads to DSPAM saving every message in the database and then those admins don't run the purge script frequent enough to clean the database from not needed data. > teft making the largest database? > Yes. TEFT is by definition leading to more data in the database. > Maybe in > practice, since there has to be some serious initial training, the > difference > is not that large..? > Compared to others like TUM, TOE, NOTRAIN? Forgett it. TEFT is insane for any one having a high inbound volume system. >> > 7. Tokenizer? >> >> OSB > Can it be easily estimated what are the relative coefficients of database > size > per tokenizer? Say, if "word" is 1, then chain 10, but sbph 200? > Read here how the tokenizers work and how much data they produce: http://sourceforge.net/apps/mediawiki/dspam/index.php?title=Tokenizers > [1] http://www.howtoforge.com/optimizing_dspam_mysql4.1 > That script is from 2006! If you use DSPAM 3.9.0 then you will see that out of the box DSPAM is already faster purging then the above mentioned SQL purge script. > ------------------------------------------------------------------------------ > SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, > Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW > http://p.sf.net/sfu/solaris-dev2dev_______________________________________________ > Dspam-user mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/dspam-user > ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev _______________________________________________ Dspam-user mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspam-user
