Hi,
I'm using DSPAM with an SQL backend (PostgreSQL).
I'm having two issues :
- on the SQL purge script :
I've been wondering if this script is correct :
The provided script is doing this :
DELETE FROM dspam_token_data
WHERE (innocent_hits*2) + spam_hits < 5
AND last_hit < CURRENT_DATE - 60;
DELETE FROM dspam_token_data
WHERE innocent_hits = 1 AND spam_hits = 0
AND last_hit < CURRENT_DATE - 15;
DELETE FROM dspam_token_data
WHERE innocent_hits = 0 AND spam_hits = 1
AND last_hit < CURRENT_DATE - 15;
DELETE FROM dspam_token_data
WHERE last_hit < CURRENT_DATE - 90;
DELETE FROM dspam_signature_data
WHERE created_on < CURRENT_DATE - 14;
So OK, it deletes token records based on the spam/innocent hits value and age
of last_hit.
But I've got a couple of questions :
- what happens of the tokens of TOE users ? I guess the tokens are deleted the
same...
- The 4 first queries will most probably do severeal sequential scans of the
token_data table. Why not do all the deletes in one pass ?
So I've rewritten the purge script in order to do what I think is right :
DELETE FROM dspam_token_data USING dspam_preferences
WHERE dspam_token_data.uid=dspam_preferences.uid
AND dspam_preferences.preference='trainingMode'
AND dspam_preferences.value='TEFT'
AND ( ( (innocent_hits*2) + spam_hits < 5 AND last_hit < CURRENT_DATE - 60 )
OR ( innocent_hits = 1 AND spam_hits = 0 AND last_hit <
CURRENT_DATE - 15 )
OR ( innocent_hits = 0 AND spam_hits = 1 AND last_hit <
CURRENT_DATE - 15 )
OR ( last_hit < CURRENT_DATE - 90 )
)
;
DELETE FROM dspam_signature_data
WHERE created_on < CURRENT_DATE - 14;
Could I get other people's point of view on this ? I'm not sure I've
understood correctly the schema. But TOE and TUM aren't touched now, and the
script runs twice as fast.
- On dspam_merge :
This thing is awfully slow (at least with postgresql) and a memory hog.
I'm trying to do it's work via SQL queries to be more efficient :
Let's say I'm trying to merge uids 3 and 10 into uid 2 (3 and 10 are my 'good'
users, 2 is my globaluser)
BEGIN;
DELETE FROM dspam_token_data WHERE uid=2;
INSERT INTO dspam_token_data (uid,token,spam_hits,innocent_hits,last_hit)
SELECT 2,token,SUM(spam_hits),SUM(innocent_hits),MAX(last_hit)
FROM dspam_token_data WHERE uid IN (3,10)
GROUP BY token;
COMMIT;
This takes 2 minutes, instead of several hours, doesn't hog either the dspam
or the database server, and globaluser remains usable during all this (the
other users still see old committed data).
On this too I'd like someone to tell me if I'm wrong.
Basically, I assumed that to merge several users tokens into one, you'd just
have, for each token, to sum the spam and innocent hits of these users, and
take the max of last_hit. It seems logical, but I'd like to be sure it's
right.
I think there are some issues here, and I'd like to help address them, as
there are huge benefits in doing things this way.