On Thu, 17 Dec 2009 16:57:45 +0100 (CET) "Nicolas Grekas" <nicolas.gre...@espci.org> wrote:
> Hello, > Hallo Nicolas, > first of all, thank you for having permitted dspam to survive after some month > of inactivity ! > :) > I've just checked the latest source for purge-4.1.sql script, > and I think I have a better one to contribute, see below. > It does exactly the same thing (well, 2 or 3 differences), but takes into > account the default trainingMode - something which is missing in the actual > script ! > > I've also added an index on 3 columns (last_hit, spam_hits, innocent_hits in > this order) on dspam_token_data ; rather than 3 indexes on 1 columns (the > same) as you recommand in the documentation. > > I also think that the last line in latest purge-4.1.sql script should not be > here (optimize table ...) : > optimizing can be very time consumming for large data sets ! > personnaly, I run my purge.sql every day, but optimize only once a month. > I run both every day. > hope it helps > It does. Allow me to add my comments inline. See below. > Regards > Nicolas > -- Kind Regards from Switzerland, Stevan Bajić > > > ---- my purge.sql script ---- > > SET @d='TEFT'; -- Default mode as defined in dspam.conf > I would do that differently. I would query the default (uid 0) and if that yields no result then fall back to TEFT: SET @trainingmode=(SELECT COALESCE((SELECT value FROM dspam_preferences WHERE uid = 0 AND preference = 'trainingMode' LIMIT 1),'TEFT')); > SET @today=to_days(current_date()); > SET @stale=30; > Default should be probably another number. According to dspam.conf: PurgeSignatures 14 # Stale signatures PurgeNeutral 90 # Tokens with neutralish probabilities PurgeUnused 90 # Unused tokens PurgeHapaxes 30 # Tokens with less than 5 hits (hapaxes) PurgeHits1S 15 # Tokens with only 1 spam hit PurgeHits1I 15 # Tokens with only 1 innocent hit So I would suggest 15 and then use factors to get to the other numbers except for signatures. > -- Delete fairly neutral tokens older than 2...@stale days > DELETE LOW_PRIORITY QUICK > FROM dspam_token_data > WHERE 2*innocent_hits + spam_hits < 5 > AND from_days(@today...@stale) > last_hit; > > -- Delete tokens seen only one time in the past @stale days > DELETE LOW_PRIORITY QUICK > FROM dspam_token_data > WHERE innocent_hits + spam_hits = 1 > AND innocent_hits BETWEEN 0 AND 1 > AND spam_hits BETWEEN 0 AND 1 > AND from_days(@tod...@stale) > last_hit; > Would that not be faster to explicitly look for 0 or 1 instead of using BETWEEN? WHERE ((innocent_hits = 1 AND spam_hits = 0) OR (innocent_hits = 0 AND spam_hits = 1)) > -- Delete all tokens never seen in the last 3...@stale days, except for TOE, > TUM > and NOTRAIN modes > DELETE LOW_PRIORITY QUICK > FROM t USING dspam_token_data t > LEFT JOIN dspam_preferences p > ON t.uid = p.uid AND p.preference = 'trainingMode' > WHERE from_days(@today...@stale) > last_hit > AND COALESCE(p.value,@d) NOT IN ('TOE','TUM','NOTRAIN'); > I don't think that this is working the proper way. The reason why I think this is not working as it should is the following. In my setup I have set preferences for the default user: mysql> select * from dspam_preferences where uid=0; +-----+------------------------+---------+ | uid | preference | value | +-----+------------------------+---------+ | 0 | dailyQuarantineSummary | off | | 0 | enableBNR | on | | 0 | enableWhitelist | on | | 0 | fallbackDomain | off | | 0 | ignoreGroups | off | | 0 | ignoreRBLLookups | off | | 0 | makeCorpus | on | | 0 | optIn | off | | 0 | optOut | on | | 0 | optOutClamAV | on | | 0 | processorBias | on | | 0 | showFactors | off | | 0 | signatureLocation | headers | | 0 | spamAction | tag | | 0 | spamSubject | [SPAM] | | 0 | statisticalSedation | 6 | | 0 | storeFragments | off | | 0 | tagNonspam | off | | 0 | tagSpam | off | | 0 | trainingMode | TOE | | 0 | trainPristine | off | | 0 | whitelistThreshold | 10 | +-----+------------------------+---------+ 22 rows in set (0.00 sec) mysql> And since I run DSPAM in opt-in mode, normal users only have the opt-in/out options set: mysql> select * from dspam_preferences where uid=2068; +------+------------+-------+ | uid | preference | value | +------+------------+-------+ | 2068 | optIn | on | | 2068 | optOut | off | +------+------------+-------+ 2 rows in set (0.00 sec) mysql> Now the problem with the above SQL clause is that it will not delete any tokens for that user since it will NOT find "preference = 'trainingMode'" for that user and beside that it will assume 'TEFT' for the user while the user is clearly using 'TOE': theia ~ # dspam_admin aggregate preference <user with uid 2068> dailyQuarantineSummary=off enableBNR=on enableWhitelist=on fallbackDomain=off ignoreGroups=off ignoreRBLLookups=off makeCorpus=on optIn=on optOut=off optOutClamAV=on processorBias=on showFactors=off signatureLocation=headers spamAction=tag spamSubject=[SPAM] statisticalSedation=6 storeFragments=off tagNonspam=off tagSpam=off trainingMode=TOE trainPristine=off whitelistThreshold=10 theia ~ # So better would be to mimic what DSPAM does by using the default (uid 0) and fall back to 'TEFT' in case there is no default AND the user has not set trainingMode. This would however require us to run the delete query twice. Once for users having set trainingMode and once for users NOT having set trainingMode. > -- Delete TUM tokens seen no more than 50 times in the last 3...@stale days > DELETE LOW_PRIORITY QUICK > FROM t USING dspam_token_data t > LEFT JOIN dspam_preferences p > ON t.uid = p.uid AND p.preference = 'trainingMode' > WHERE from_days(@today...@stale) > last_hit > AND innocent_hits + spam_hits < 50 > AND COALESCE(p.value,@d) = 'TUM'; > The same applies here as above. Users NOT having set trainingMode should have another delete run. > -- Delete signatures older than @stale days > DELETE LOW_PRIORITY QUICK > FROM dspam_signature_data > WHERE from_days(@tod...@stale) > created_on; > IMHO that should be the first SQL clause since that will remove more tokens then any other query and things that run on a global scale without uid should(/could) be runned before any other query. > -- Cleanup dictionnaries of passive users > DELETE LOW_PRIORITY QUICK > FROM t USING dspam_token_data t > JOIN dspam_stats s > ON t.uid=s.uid > WHERE from_days(@tod...@stale) > last_hit > AND s.spam_misclassified=0 > AND s.innocent_misclassified=0; > Such a query should run as one of the first queries. But why do you punish users not having reclassified anything? If a user is running TEFT or TUM an he/she never has any error to reclassify then this query will delete ALL tokens for that user. So you are punishing users for nothing. What is the purpose/benefit of doing that? If you would ask me then I would craft the purge script as follow: -- -- Set some defaults -- SET @trainingmode=(SELECT COALESCE((SELECT value FROM dspam_preferences WHERE uid = 0 AND preference = 'trainingMode' LIMIT 1),'TEFT')); SET @today=to_days(current_date()); SET @stale=15; -- -- Delete fairly neutral tokens older than 4...@stale days -- START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM dspam_token_data WHERE from_days(@today-(4...@stale)) > last_hit AND (innocent_hits*2) + spam_hits < 5; COMMIT; -- -- Delete tokens seen only one time per class in the past @stale days -- START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM dspam_token_data WHERE from_days(@tod...@stale) > last_hit AND ((innocent_hits = 1 AND spam_hits = 0) OR (innocent_hits = 0 AND spam_hits = 1)); COMMIT; -- -- Delete all tokens never seen in the last 4...@stale days, except for TOE, TUM and NOTRAIN modes -- -- For users having set 'trainingMode' in their preferences START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM t USING dspam_token_data t LEFT JOIN dspam_preferences p ON (t.uid = p.uid AND p.preference = 'trainingMode' AND p.value NOT IN ('TOE','TUM','NOTRAIN')) WHERE from_days(@today-(4...@stale)) > last_hit; COMMIT; -- For users NOT having set 'trainingMode' in their preferences (therefor using default/uid 0) START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM t USING dspam_token_data t LEFT JOIN dspam_preferences p ON (t.uid = p.uid) WHERE @trainingmode NOT IN ('TOE','TUM','NOTRAIN') AND NOT EXISTS (SELECT uid FROM dspam_preferences WHERE uid = t.uid AND preference = 'trainingMode' LIMIT 1) AND from_days(@today-(4...@stale)) > last_hit; COMMIT; -- -- Delete TUM tokens seen no more than 50 times in the last 6...@stale days -- -- For users having set 'trainingMode' in their preferences START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM t USING dspam_token_data t LEFT JOIN dspam_preferences p ON (t.uid = p.uid AND p.preference = 'trainingMode' AND p.value = 'TUM') WHERE from_days(@today-(6...@stale)) > last_hit AND innocent_hits + spam_hits < 50; COMMIT; -- For users NOT having set 'trainingMode' in their preferences (therefor using default/uid 0) START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM t USING dspam_token_data t LEFT JOIN dspam_preferences p ON (t.uid = p.uid) WHERE from_days(@today-(6...@stale)) > last_hit AND NOT EXISTS (SELECT uid FROM dspam_preferences WHERE uid = t.uid AND preference = 'trainingMode' LIMIT 1) AND innocent_hits + spam_hits < 50 AND @trainingmode = 'TUM'; COMMIT; -- -- Delete signatures older than @stale days -- START TRANSACTION; DELETE LOW_PRIORITY QUICK FROM dspam_signature_data WHERE from_days(@today-(@stale-1)) > created_on; COMMIT; ------------------------------------------------------------------------------ This SF.Net email is sponsored by the Verizon Developer Community Take advantage of Verizon's best-in-class app development support A streamlined, 14 day to market process makes app distribution fast and easy Join now and get one step closer to millions of Verizon customers http://p.sf.net/sfu/verizon-dev2dev _______________________________________________ Dspam-devel mailing list Dspam-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspam-devel