On Thu, 17 Dec 2009 16:57:45 +0100 (CET)
"Nicolas Grekas" <nicolas.gre...@espci.org> wrote:

> Hello,
> 
Hallo Nicolas,


> first of all, thank you for having permitted dspam to survive after some month
> of inactivity !
> 
:)


> I've just checked the latest source for purge-4.1.sql script,
> and I think I have a better one to contribute, see below.
> It does exactly the same thing (well, 2 or 3 differences), but takes into
> account the default trainingMode - something which is missing in the actual
> script !
> 
> I've also added an index on 3 columns (last_hit, spam_hits, innocent_hits in
> this order) on dspam_token_data ; rather than 3 indexes on 1 columns (the
> same) as you recommand in the documentation.
> 
> I also think that the last line in latest purge-4.1.sql script should not be
> here (optimize table ...) :
> optimizing can be very time consumming for large data sets !
> personnaly, I run my purge.sql every day, but optimize only once a month.
> 
I run both every day.


> hope it helps
> 
It does. Allow me to add my comments inline. See below.


> Regards
> Nicolas
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić


> 
> 
> ---- my purge.sql script ----
> 
> SET @d='TEFT'; -- Default mode as defined in dspam.conf
>
I would do that differently. I would query the default (uid 0) and if that 
yields no result then fall back to TEFT:
SET @trainingmode=(SELECT COALESCE((SELECT value FROM dspam_preferences WHERE 
uid = 0 AND preference = 'trainingMode' LIMIT 1),'TEFT'));


> SET @today=to_days(current_date());
> SET @stale=30;
> 
Default should be probably another number. According to dspam.conf:
PurgeSignatures 14          # Stale signatures
PurgeNeutral    90          # Tokens with neutralish probabilities
PurgeUnused     90          # Unused tokens
PurgeHapaxes    30          # Tokens with less than 5 hits (hapaxes)
PurgeHits1S    15          # Tokens with only 1 spam hit
PurgeHits1I    15          # Tokens with only 1 innocent hit

So I would suggest 15 and then use factors to get to the other numbers except 
for signatures.


> -- Delete fairly neutral tokens older than 2...@stale days
> DELETE LOW_PRIORITY QUICK
> FROM dspam_token_data
> WHERE 2*innocent_hits + spam_hits < 5
>   AND from_days(@today...@stale) > last_hit;
> 
> -- Delete tokens seen only one time in the past @stale days
> DELETE LOW_PRIORITY QUICK
> FROM dspam_token_data
> WHERE innocent_hits + spam_hits = 1
>   AND innocent_hits BETWEEN 0 AND 1
>   AND spam_hits BETWEEN 0 AND 1
>   AND from_days(@tod...@stale) > last_hit;
> 
Would that not be faster to explicitly look for 0 or 1 instead of using BETWEEN?
WHERE ((innocent_hits = 1 AND spam_hits = 0) OR (innocent_hits = 0 AND 
spam_hits = 1))


> -- Delete all tokens never seen in the last 3...@stale days, except for TOE, 
> TUM
> and NOTRAIN modes
> DELETE LOW_PRIORITY QUICK
> FROM t USING dspam_token_data t
>   LEFT JOIN dspam_preferences p
>     ON t.uid = p.uid AND p.preference = 'trainingMode'
> WHERE from_days(@today...@stale) > last_hit
>   AND COALESCE(p.value,@d) NOT IN ('TOE','TUM','NOTRAIN');
> 
I don't think that this is working the proper way. The reason why I think this 
is not working as it should is the following. In my setup I have set 
preferences for the default user:
mysql> select * from dspam_preferences where uid=0;
+-----+------------------------+---------+
| uid | preference             | value   |
+-----+------------------------+---------+
|   0 | dailyQuarantineSummary | off     |
|   0 | enableBNR              | on      |
|   0 | enableWhitelist        | on      |
|   0 | fallbackDomain         | off     |
|   0 | ignoreGroups           | off     |
|   0 | ignoreRBLLookups       | off     |
|   0 | makeCorpus             | on      |
|   0 | optIn                  | off     |
|   0 | optOut                 | on      |
|   0 | optOutClamAV           | on      |
|   0 | processorBias          | on      |
|   0 | showFactors            | off     |
|   0 | signatureLocation      | headers |
|   0 | spamAction             | tag     |
|   0 | spamSubject            | [SPAM]  |
|   0 | statisticalSedation    | 6       |
|   0 | storeFragments         | off     |
|   0 | tagNonspam             | off     |
|   0 | tagSpam                | off     |
|   0 | trainingMode           | TOE     |
|   0 | trainPristine          | off     |
|   0 | whitelistThreshold     | 10      |
+-----+------------------------+---------+
22 rows in set (0.00 sec)

mysql>


And since I run DSPAM in opt-in mode, normal users only have the opt-in/out 
options set:
mysql> select * from dspam_preferences where uid=2068;
+------+------------+-------+
| uid  | preference | value |
+------+------------+-------+
| 2068 | optIn      | on    |
| 2068 | optOut     | off   |
+------+------------+-------+
2 rows in set (0.00 sec)

mysql>


Now the problem with the above SQL clause is that it will not delete any tokens 
for that user since it will NOT find "preference = 'trainingMode'" for that 
user and beside that it will assume 'TEFT' for the user while the user is 
clearly using 'TOE':
theia ~ # dspam_admin aggregate preference <user with uid 2068>
dailyQuarantineSummary=off
enableBNR=on
enableWhitelist=on
fallbackDomain=off
ignoreGroups=off
ignoreRBLLookups=off
makeCorpus=on
optIn=on
optOut=off
optOutClamAV=on
processorBias=on
showFactors=off
signatureLocation=headers
spamAction=tag
spamSubject=[SPAM]
statisticalSedation=6
storeFragments=off
tagNonspam=off
tagSpam=off
trainingMode=TOE
trainPristine=off
whitelistThreshold=10
theia ~ # 


So better would be to mimic what DSPAM does by using the default (uid 0) and 
fall back to 'TEFT' in case there is no default AND the user has not set 
trainingMode. This would however require us to run the delete query twice. Once 
for users having set trainingMode and once for users NOT having set 
trainingMode.


> -- Delete TUM tokens seen no more than 50 times in the last 3...@stale days
> DELETE LOW_PRIORITY QUICK
> FROM t USING dspam_token_data t
>   LEFT JOIN dspam_preferences p
>     ON t.uid = p.uid AND p.preference = 'trainingMode'
> WHERE from_days(@today...@stale) > last_hit
>   AND innocent_hits + spam_hits < 50
>   AND COALESCE(p.value,@d) = 'TUM';
> 
The same applies here as above. Users NOT having set trainingMode should have 
another delete run.


> -- Delete signatures older than @stale days
> DELETE LOW_PRIORITY QUICK
> FROM dspam_signature_data
> WHERE from_days(@tod...@stale) > created_on;
> 
IMHO that should be the first SQL clause since that will remove more tokens 
then any other query and things that run on a global scale without uid 
should(/could) be runned before any other query.


> -- Cleanup dictionnaries of passive users
> DELETE LOW_PRIORITY QUICK
> FROM t USING dspam_token_data t
>   JOIN dspam_stats s
>     ON t.uid=s.uid
> WHERE from_days(@tod...@stale) > last_hit
>   AND s.spam_misclassified=0
>   AND s.innocent_misclassified=0;
> 
Such a query should run as one of the first queries. But why do you punish 
users not having reclassified anything? If a user is running TEFT or TUM an 
he/she never has any error to reclassify then this query will delete ALL tokens 
for that user. So you are punishing users for nothing. What is the 
purpose/benefit of doing that?


If you would ask me then I would craft the purge script as follow:
--
-- Set some defaults
--
SET @trainingmode=(SELECT COALESCE((SELECT value FROM dspam_preferences WHERE 
uid = 0 AND preference = 'trainingMode' LIMIT 1),'TEFT'));
SET @today=to_days(current_date());
SET @stale=15;

--
-- Delete fairly neutral tokens older than 4...@stale days
--
START TRANSACTION;
DELETE LOW_PRIORITY QUICK
  FROM dspam_token_data
    WHERE from_days(@today-(4...@stale)) > last_hit
    AND (innocent_hits*2) + spam_hits < 5;
COMMIT;

--
-- Delete tokens seen only one time per class in the past @stale days
--
START TRANSACTION;
DELETE LOW_PRIORITY QUICK
  FROM dspam_token_data
    WHERE from_days(@tod...@stale) > last_hit
      AND ((innocent_hits = 1 AND spam_hits = 0) OR (innocent_hits = 0 AND 
spam_hits = 1));
COMMIT;

--
-- Delete all tokens never seen in the last 4...@stale days, except for TOE, 
TUM and NOTRAIN modes
--

-- For users having set 'trainingMode' in their preferences
START TRANSACTION;
DELETE LOW_PRIORITY QUICK
  FROM t USING dspam_token_data t
    LEFT JOIN dspam_preferences p
      ON (t.uid = p.uid AND p.preference = 'trainingMode' AND p.value NOT IN 
('TOE','TUM','NOTRAIN'))
  WHERE from_days(@today-(4...@stale)) > last_hit;
COMMIT;

-- For users NOT having set 'trainingMode' in their preferences (therefor using 
default/uid 0)
START TRANSACTION;
DELETE LOW_PRIORITY QUICK
  FROM t USING dspam_token_data t
    LEFT JOIN dspam_preferences p
      ON (t.uid = p.uid)
  WHERE @trainingmode NOT IN ('TOE','TUM','NOTRAIN')
    AND NOT EXISTS (SELECT uid FROM dspam_preferences WHERE uid = t.uid AND 
preference = 'trainingMode' LIMIT 1)
    AND from_days(@today-(4...@stale)) > last_hit;
COMMIT;

--
-- Delete TUM tokens seen no more than 50 times in the last 6...@stale days
--

-- For users having set 'trainingMode' in their preferences
START TRANSACTION;
  DELETE LOW_PRIORITY QUICK
    FROM t USING dspam_token_data t
      LEFT JOIN dspam_preferences p
        ON (t.uid = p.uid AND p.preference = 'trainingMode' AND p.value = 'TUM')
    WHERE from_days(@today-(6...@stale)) > last_hit
      AND innocent_hits + spam_hits < 50;
COMMIT;

-- For users NOT having set 'trainingMode' in their preferences (therefor using 
default/uid 0)
START TRANSACTION;
  DELETE LOW_PRIORITY QUICK
    FROM t USING dspam_token_data t
      LEFT JOIN dspam_preferences p
        ON (t.uid = p.uid)
    WHERE from_days(@today-(6...@stale)) > last_hit
      AND NOT EXISTS (SELECT uid FROM dspam_preferences WHERE uid = t.uid AND 
preference = 'trainingMode' LIMIT 1)
      AND innocent_hits + spam_hits < 50
      AND @trainingmode = 'TUM';
COMMIT;

--
-- Delete signatures older than @stale days
--
START TRANSACTION;
DELETE LOW_PRIORITY QUICK
  FROM dspam_signature_data
  WHERE from_days(@today-(@stale-1)) > created_on;
COMMIT;

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-devel mailing list
Dspam-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Reply via email to