On Tue, Mar 29, 2011 at 08:27:25PM +0300, Ibrahim Harrani wrote: > Hi Kenneth, > > > Okay. I think that at the very least you will need to purge the > > unuseful tokens from your training corpus. All of the tokens with > > nearly equal ham/spam counts as well as the tokens with a very > > small ham/spam count. > > Is there any way to purge unuseful tokens without deleting everything > and start from scratch? > Why dspam does not purge them automatically? > I also trained spamassassin with the same corpus. spamd handles all > those staffs. > > I purge the database with the following purge sql shipped by dspam. > Should I use also dspam_clean? > > /* $Id: purge.sql,v 1.52 2010/04/21 11:30:39 sbajic Exp $ */ > > START TRANSACTION; > DELETE FROM dspam_token_data > WHERE (innocent_hits*2) + spam_hits < 5 > AND last_hit < CURRENT_DATE - 30; > COMMIT; > > START TRANSACTION; > DELETE FROM dspam_token_data > WHERE ((innocent_hits=1 AND spam_hits=0) OR (innocent_hits=0 AND > spam_hits=1)) > AND last_hit < CURRENT_DATE - 15; > COMMIT; > > START TRANSACTION; > DELETE FROM dspam_token_data > WHERE last_hit < CURRENT_DATE - 90; > COMMIT; > > START TRANSACTION; > DELETE FROM dspam_signature_data > WHERE created_on < CURRENT_DATE - 14; > COMMIT; > > VACUUM ANALYSE dspam_token_data; > VACUUM ANALYSE dspam_signature_data; > > REINDEX TABLE dspam_token_data; > > REINDEX TABLE dspam_signature_data; > > > How did you change the fillfactor? You would need to do an alter > > table followed by a cluster to rewrite the table and include the > > freespace required by the fillfactor. Is that how you performed > > that operation. You could also do a full copy to a new table with > > the correct fillfactor. > > I simply issued > ALTER TABLE dspam_token_data SET ( FILLFACTOR = 90 ); > But I did not make a fully copy on the table. > Can you give me an example how to issue cluster on the table? > > > One other thing to try would be to use Markov/OSB instead of CHAIN. > > OSB generates a few more tokens than CHAIN, but it is much more > > accurate and so you will need fewer tokens to actually identify the > > ham/spam. Then, instead of simply loading all of your messages at > > once, load them incrementally and only train if the existing corpus > > fails to correctly identify the message. > > Is there any way to do this type of training automatically? > If I train the dspam mail by mail. This will take hour/days? > > I will try Markov/ISB. I guess I have to delete old tokes and train dspam > again. > What is the recommended value for ham/spam training. Does 10K spam 10ham > enough? > > Using something like iostat while you are processing should give > > you an idea of whether you are I/O bound or not. > > > > And lastly, make certain that you have "synchronous_commit = off" > > in your postgresql.conf file. > Yes > > > Cheers, > > Ken > > >
You can use dspam_clean to perform most of the cleanup needed, but the SQL is pretty simple too: dspam_clean performs the following operations: 1. Using the -s flag, dspam_clean will continue to perform stale signature purging. If an age is specified, for example -s14, the age defined as the default will be overridden. Specifying an age of 0 will delete all signatures for the users processed. 2. Using the -p flag, dspam_clean will delete all tokens from a user's database whose probability is between 0.35 and 0.65 (fairly neutral, useless tokens) that fall beyond the default age. If an age is specified, for example -p30, the age defined as the default will be overridden. It is a good idea to use this type of clean with an age of 0 on users after a lot of corpus training. 3. Using the -u flag, dspam_clean will delete all unused tokens from a user's database. There are four different types of unused tokens: - Tokens which have not been used for a long time - Tokens which have a total hit count below 5 - Tokens which have only one spam hit - Tokens which have only one innocent hit Cheers, Ken ------------------------------------------------------------------------------ Enable your software for Intel(R) Active Management Technology to meet the growing manageability and security demands of your customers. Businesses are taking advantage of Intel(R) vPro (TM) technology - will your software be a part of the solution? Download the Intel(R) Manageability Checker today! http://p.sf.net/sfu/intel-dev2devmar _______________________________________________ Dspam-user mailing list Dspam-user@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-user