On Tue, Mar 29, 2011 at 08:27:25PM +0300, Ibrahim Harrani wrote:
> Hi Kenneth,
> 
> > Okay. I think that at the very least you will need to purge the
> > unuseful tokens from your training corpus. All of the tokens with
> > nearly equal ham/spam counts as well as the tokens with a very
> > small ham/spam count.
> 
> Is there any way to purge unuseful tokens without deleting everything
> and start from scratch?
> Why dspam  does not purge them automatically?
> I also trained spamassassin with the same corpus. spamd handles all
> those staffs.
> 
> I purge the database with the following purge sql shipped by dspam.
> Should I use also dspam_clean?
> 
> /* $Id: purge.sql,v 1.52 2010/04/21 11:30:39 sbajic Exp $ */
> 
> START TRANSACTION;
> DELETE FROM dspam_token_data
>   WHERE (innocent_hits*2) + spam_hits < 5
>   AND last_hit < CURRENT_DATE - 30;
> COMMIT;
> 
> START TRANSACTION;
> DELETE FROM dspam_token_data
>   WHERE ((innocent_hits=1 AND spam_hits=0) OR (innocent_hits=0 AND 
> spam_hits=1))
>   AND last_hit < CURRENT_DATE - 15;
> COMMIT;
> 
> START TRANSACTION;
> DELETE FROM dspam_token_data
>   WHERE last_hit < CURRENT_DATE - 90;
> COMMIT;
> 
> START TRANSACTION;
> DELETE FROM dspam_signature_data
>   WHERE created_on < CURRENT_DATE - 14;
> COMMIT;
> 
> VACUUM ANALYSE dspam_token_data;
> VACUUM ANALYSE dspam_signature_data;
> 
> REINDEX TABLE dspam_token_data;
> 
> REINDEX TABLE dspam_signature_data;
> 
> > How did you change the fillfactor? You would need to do an alter
> > table followed by a cluster to rewrite the table and include the
> > freespace required by the fillfactor. Is that how you performed
> > that operation. You could also do a full copy to a new table with
> > the correct fillfactor.
> 
> I simply issued
> ALTER TABLE dspam_token_data SET ( FILLFACTOR = 90 );
> But I did not make a fully copy on the table.
> Can you give me an example how to issue cluster on the table?
> 
> > One other thing to try would be to use Markov/OSB instead of CHAIN.
> > OSB generates a few more tokens than CHAIN, but it is much more
> > accurate and so you will need fewer tokens to actually identify the
> > ham/spam. Then, instead of simply loading all of your messages at
> > once, load them incrementally and only train if the existing corpus
> > fails to correctly identify the message.
> 
> Is there any way to do this type of training automatically?
> If I train the dspam mail by mail. This will take hour/days?
> 
> I will try Markov/ISB. I guess I have to delete old tokes and train dspam 
> again.
> What is the recommended value for ham/spam training. Does 10K spam 10ham 
> enough?
> > Using something like iostat while you are processing should give
> > you an idea of whether you are I/O bound or not.
> >
> > And lastly, make certain that you have "synchronous_commit = off"
> > in your postgresql.conf file.
> Yes
> 
> > Cheers,
> > Ken
> >
> 

You can use dspam_clean to perform most of the cleanup needed, but
the SQL is pretty simple too:

dspam_clean performs the following operations:

    1. Using the -s flag, dspam_clean will continue to perform stale signature
     purging.  If an age is specified, for example -s14, the age defined as the
     default will be overridden.  Specifying an age of 0 will delete all
     signatures for the users processed.

    2. Using the -p flag, dspam_clean will delete all tokens from a user's
     database whose probability is between 0.35 and 0.65 (fairly neutral,
     useless tokens) that fall beyond the default age.  If an age is specified,
     for example -p30, the age defined as the default will be overridden.  It
     is a good idea to use this type of clean with an age of 0 on users after
     a lot of corpus training.

    3. Using the -u flag, dspam_clean will delete all unused tokens from a
     user's database.  There are four different types of unused tokens:

     - Tokens which have not been used for a long time
     - Tokens which have a total hit count below 5
     - Tokens which have only one spam hit
     - Tokens which have only one innocent hit


Cheers,
Ken

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to