On Sat, 19 Dec 2009 14:30:28 +0100 (CET)
"Nicolas Grekas" <nicolas.gre...@espci.org> wrote:

> > I would do that differently. I would query the default (uid 0)
> 
> You are right ! I've learnd that uid 0 is the default very recently
> and forgot to take it into account
> 
:)


> > According to dspam.conf:[...]
> 
> so, I've taken your updated script, and crafted it to follow as close as
> possible what dspam_clean does (at least as said in the man page).
> 
> So my modifs are :
> - add the same sql variables as dspam_clean manages
>
Great.

> - load uid 0 training pref (in a single query)
> 
Great.


> I've also added a query to delete old token whose probability is between 0.35
> and 0.65.
> 
Here I have to intercept. That 0.35 to 0.65 is not that easy to compute as you 
have done in the SQL clause. The problem is that to compute the probability you 
would need to query the stats table and use the totals from there and then you 
would need to read PValue and based on what is there do the computation. And 
this is just the basic stuff. You would still need to read the group file from 
DSPAM and look if the user is belonging to a shared/merged group and load the 
totals from there as well and and and... to make it short: I would not try to 
purge neutral tokens from within the SQL purge script. It will get to 
complicated.


> For the transaction parts, I've always though that a single SQL query is
> always atomic, so no need for a transaction for just one query.
> Am I wrong ?
> 
No. You are right. Single queries are atomic. It's just convinient to add the 
transaction parts into the script in case that one is going to extend that 
block and add other stuff there. Then the person modifying the block does not 
need to care about transactions and we could later even implement a roll-back 
if needed.


> >> -- Cleanup dictionnaries of passive users
> 
> > Such a query should run as one of the first queries. But why do you punish
> > users not having reclassified anything?
> 
> Then that may be too specific to my setup ... :)
> 
No, no. My error. I later realized that it is the signature table and not the 
token table. So it's not at all important when you purge them.


> So, how about this new proposition ?
> 
Looking good IMHO. Need to quickly test it and then push it to GIT :)
Thanks for the time to craft those SQL clauses. Now you should be nice and go 
on and install PostgreSQL and do the same for PostgreSQL :) :) :) ;)

btw: I have done some quick tests with MySQL 5.1.41 and the additional indices. 
On a InnoDB table adding those indices do not speed up the purging. They do 
speed up but it's so unsignificant that I ask my self if it is really that 
beneficial to index all fields from the dspam_token_data table (and double the 
size of the table)? Had you a big performance impact when enabling the index of 
all 3 additional fields? What engine are you using?


> Nicolas
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić


> 
> --
> -- Set some defaults
> --
> SET @TrainingMode    = 'TEFT';      -- Default training mode
> SET @PurgeSignatures = 14;          -- Stale signatures
> SET @PurgeNeutral    = 90;          -- Tokens with neutralish probabilities
> SET @PurgeUnused     = 90;          -- Unused tokens
> SET @PurgeHapaxes    = 30;          -- Tokens with less than 5 hits (hapaxes)
> SET @PurgeHits1S     = 15;          -- Tokens with only 1 spam hit
> SET @PurgeHits1I     = 15;          -- Tokens with only 1 innocent hit
> SET @today = to_days(current_date());
> 
> --
> -- Delete tokens with less than 5 hits (hapaxes)
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM dspam_token_data
>     WHERE from_days(@tod...@purgehapaxes) > last_hit
>       AND (2*innocent_hits)+spam_hits < 5;
> COMMIT;
> 
> --
> -- Delete tokens with only 1 spam hit
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM dspam_token_data
>     WHERE from_days(@tod...@purgehits1s) > last_hit
>       AND innocent_hits = 0 AND spam_hits = 1;
> COMMIT;
> 
> --
> -- Delete tokens with only 1 innocent hit
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM dspam_token_data
>     WHERE from_days(@tod...@purgehits1i) > last_hit
>       AND innocent_hits = 1 AND spam_hits = 0;
> COMMIT;
> 
> --
> -- Delete tokens with neutralish probabilities
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM dspam_token_data
>     WHERE from_days(@tod...@purgeneutral) > last_hit
>       AND spam_hits/(innocent_hits+spam_hits) BETWEEN 0.35 AND 0.65
> COMMIT;
> 
> --
> -- Delete unused tokens, except for TOE, TUM and NOTRAIN modes
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM t USING dspam_token_data t
>     LEFT JOIN dspam_preferences p ON p.preference = 'trainingMode' AND p.uid =
> t.uid
>     LEFT JOIN dspam_preferences d ON d.preference = 'trainingMode' AND d.uid 
> = 0
>   WHERE from_days(@tod...@purgeunused) > last_hit
>     AND COALESCE(p.value,d.value,@TrainingMode) NOT IN 
> ('TOE','TUM','NOTRAIN');
> COMMIT;
> 
> --
> -- Delete TUM tokens seen no more than 50 times
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM t USING dspam_token_data t
>     LEFT JOIN dspam_preferences p ON p.preference = 'trainingMode' AND p.uid =
> t.uid
>     LEFT JOIN dspam_preferences d ON d.preference = 'trainingMode' AND d.uid 
> = 0
>   WHERE from_days(@tod...@purgeunused) > last_hit
>     AND COALESCE(p.value,d.value,@TrainingMode) = 'TUM'
>     AND innocent_hits + spam_hits < 50;
> COMMIT;
> 
> --
> -- Delete stale signatures
> --
> START TRANSACTION;
> DELETE LOW_PRIORITY QUICK
>   FROM dspam_signature_data
>   WHERE from_days(@tod...@purgesignatures) > created_on;
> COMMIT;
> 
> 
> ------------------------------------------------------------------------------
> This SF.Net email is sponsored by the Verizon Developer Community
> Take advantage of Verizon's best-in-class app development support
> A streamlined, 14 day to market process makes app distribution fast and easy
> Join now and get one step closer to millions of Verizon customers
> http://p.sf.net/sfu/verizon-dev2dev 
> _______________________________________________
> Dspam-devel mailing list
> Dspam-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dspam-devel
> 

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to