Hi Kenneth,

> Okay. I think that at the very least you will need to purge the
> unuseful tokens from your training corpus. All of the tokens with
> nearly equal ham/spam counts as well as the tokens with a very
> small ham/spam count.

Is there any way to purge unuseful tokens without deleting everything
and start from scratch?
Why dspam  does not purge them automatically?
I also trained spamassassin with the same corpus. spamd handles all
those staffs.

I purge the database with the following purge sql shipped by dspam.
Should I use also dspam_clean?

/* $Id: purge.sql,v 1.52 2010/04/21 11:30:39 sbajic Exp $ */

START TRANSACTION;
DELETE FROM dspam_token_data
  WHERE (innocent_hits*2) + spam_hits < 5
  AND last_hit < CURRENT_DATE - 30;
COMMIT;

START TRANSACTION;
DELETE FROM dspam_token_data
  WHERE ((innocent_hits=1 AND spam_hits=0) OR (innocent_hits=0 AND spam_hits=1))
  AND last_hit < CURRENT_DATE - 15;
COMMIT;

START TRANSACTION;
DELETE FROM dspam_token_data
  WHERE last_hit < CURRENT_DATE - 90;
COMMIT;

START TRANSACTION;
DELETE FROM dspam_signature_data
  WHERE created_on < CURRENT_DATE - 14;
COMMIT;

VACUUM ANALYSE dspam_token_data;
VACUUM ANALYSE dspam_signature_data;

REINDEX TABLE dspam_token_data;

REINDEX TABLE dspam_signature_data;

> How did you change the fillfactor? You would need to do an alter
> table followed by a cluster to rewrite the table and include the
> freespace required by the fillfactor. Is that how you performed
> that operation. You could also do a full copy to a new table with
> the correct fillfactor.

I simply issued
ALTER TABLE dspam_token_data SET ( FILLFACTOR = 90 );
But I did not make a fully copy on the table.
Can you give me an example how to issue cluster on the table?

> One other thing to try would be to use Markov/OSB instead of CHAIN.
> OSB generates a few more tokens than CHAIN, but it is much more
> accurate and so you will need fewer tokens to actually identify the
> ham/spam. Then, instead of simply loading all of your messages at
> once, load them incrementally and only train if the existing corpus
> fails to correctly identify the message.

Is there any way to do this type of training automatically?
If I train the dspam mail by mail. This will take hour/days?

I will try Markov/ISB. I guess I have to delete old tokes and train dspam again.
What is the recommended value for ham/spam training. Does 10K spam 10ham enough?
> Using something like iostat while you are processing should give
> you an idea of whether you are I/O bound or not.
>
> And lastly, make certain that you have "synchronous_commit = off"
> in your postgresql.conf file.
Yes

> Cheers,
> Ken
>

------------------------------------------------------------------------------
Enable your software for Intel(R) Active Management Technology to meet the
growing manageability and security demands of your customers. Businesses
are taking advantage of Intel(R) vPro (TM) technology - will your software 
be a part of the solution? Download the Intel(R) Manageability Checker 
today! http://p.sf.net/sfu/intel-dev2devmar
_______________________________________________
Dspam-user mailing list
Dspam-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-user

Reply via email to