Re: [Dspam-devel] Better purge.sql for MySQL

Stevan Bajić Sat, 19 Dec 2009 08:21:12 -0800

On Sat, 19 Dec 2009 09:50:12 -0600
Kenneth Marshall <[email protected]> wrote:


> On Sat, Dec 19, 2009 at 03:28:01PM +0100, Stevan Baji?? wrote:
> > On Sat, 19 Dec 2009 14:30:28 +0100 (CET)
> > "Nicolas Grekas" <[email protected]> wrote:
> > 
> > > > I would do that differently. I would query the default (uid 0)
> > > 
> > > You are right ! I've learnd that uid 0 is the default very recently
> > > and forgot to take it into account
> > > 
> > :)
> > 
> > 
> > > > According to dspam.conf:[...]
> > > 
> > > so, I've taken your updated script, and crafted it to follow as close as
> > > possible what dspam_clean does (at least as said in the man page).
> > > 
> > > So my modifs are :
> > > - add the same sql variables as dspam_clean manages
> > >
> > Great.
> > 
> > > - load uid 0 training pref (in a single query)
> > > 
> > Great.
> > 
> > 
> > > I've also added a query to delete old token whose probability is between 
> > > 0.35
> > > and 0.65.
> > > 
> > Here I have to intercept. That 0.35 to 0.65 is not that easy to compute as 
> > you have done in the SQL clause. The problem is that to compute the 
> > probability you would need to query the stats table and use the totals from 
> > there and then you would need to read PValue and based on what is there do 
> > the computation. And this is just the basic stuff. You would still need to 
> > read the group file from DSPAM and look if the user is belonging to a 
> > shared/merged group and load the totals from there as well and and and... 
> > to make it short: I would not try to purge neutral tokens from within the 
> > SQL purge script. It will get to complicated.
> > 
> > 
> > > For the transaction parts, I've always though that a single SQL query is
> > > always atomic, so no need for a transaction for just one query.
> > > Am I wrong ?
> > > 
> > No. You are right. Single queries are atomic. It's just convinient to add 
> > the transaction parts into the script in case that one is going to extend 
> > that block and add other stuff there. Then the person modifying the block 
> > does not need to care about transactions and we could later even implement 
> > a roll-back if needed.
> > 
> > 
> > > >> -- Cleanup dictionnaries of passive users
> > > 
> > > > Such a query should run as one of the first queries. But why do you 
> > > > punish
> > > > users not having reclassified anything?
> > > 
> > > Then that may be too specific to my setup ... :)
> > > 
> > No, no. My error. I later realized that it is the signature table and not 
> > the token table. So it's not at all important when you purge them.
> > 
> > 
> > > So, how about this new proposition ?
> > > 
> > Looking good IMHO. Need to quickly test it and then push it to GIT :)
> > Thanks for the time to craft those SQL clauses. Now you should be nice and 
> > go on and install PostgreSQL and do the same for PostgreSQL :) :) :) ;)
> > 
> > btw: I have done some quick tests with MySQL 5.1.41 and the additional 
> > indices. On a InnoDB table adding those indices do not speed up the 
> > purging. They do speed up but it's so unsignificant that I ask my self if 
> > it is really that beneficial to index all fields from the dspam_token_data 
> > table (and double the size of the table)? Had you a big performance impact 
> > when enabling the index of all 3 additional fields? What engine are you 
> > using?
> 
> Stevan,
> 
Hallo Ken,


> I am not a MySQL expert, but no index updates are free
>
they are not free in MySQL. First of all they use space. And then they use CPU 
time when doing insert/delete/modify/etc.

Most users could easy live with the additional space. Some of my installations 
are in the GB area but it does not bother me that they use more space. But the 
CPU speed is bothering me. And to be honest:

The table in MySQL is:
+---------------+---------------------+------+-----+---------+-------+
| Field         | Type                | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| uid           | int(10) unsigned    | NO   | PRI | NULL    |       |
| token         | bigint(20) unsigned | NO   | PRI | NULL    |       |
| spam_hits     | bigint(20) unsigned | NO   |     | NULL    |       |
| innocent_hits | bigint(20) unsigned | NO   |     | NULL    |       |
| last_hit      | date                | NO   |     | NULL    |       |
+---------------+---------------------+------+-----+---------+-------+


We have an index on (uid,token). That's fine. Makes lookups faster and without 
it things would be ultra slow.

But then adding an additional index for (spam_hits,innocent_hits,last_hit) is 
insane. All fields would be indexed. That does not sound right to me.

In the old days (aka: <3.9.0) that "hack" helped to get dspam_clean work 
faster. But we changed MySQL and the PostgreSQL driver in 3.9.0 and now they 
are fast even if you don't index anything other then just (uid,token). On one 
of my clusters running dspam_clean WITHOUT having indexed any of the other 
additional fields is reasonable fast:
theia dspam # time dspam_clean -s14 -p30 -u90,30,15,15

real    0m25.259s
user    0m18.220s
sys     0m0.720s
theia dspam #

And the token table is not that small:
theia dspam # mysql --user=root --password=$(</root/mysql.pwd) 
--socket=/var/run/mysqld/mysqld.sock -e "select count(uid) from 
sysdb_dspam.dspam_token_data where 1;"
+------------+
| count(uid) |
+------------+
|   13798848 |
+------------+
theia dspam #

25 seconds is nothing. If I look at those 25 seconds from an economical 
viewpoint then I can say you already now that I save much more combined time 
with every message that I process because I don't have the additional indices 
on dspam_token_data. Let's say (just as an example) I would only save 0.1 
second per message. Then already after 250 messages I have justified the 25 
seconds it takes for dspam_clean. Because if I would use the additional indices 
I don't think that I would get dspam_clean to be much faster. And even if it 
would be just 1 second instead of 25 seconds I still would gain more in mail 
processing by not using the additional indices. 250 messages is nothing. I 
process way, way, way more.


> so less is more, especially
> when there is not a significat performance advantage.
>
I don't really see any additional benefit. Normally one would run the purging 
at times where the big part of users is not using DSPAM. So if that purging 
would take 60 minutes I still find that acceptable.


> I would think that a full
> table scan would be the cheapest way to purge tokens in a DB, unless you are 
> running
> a very small DSPAM instance and all of your token data is in memory.
> 
A good RDBMS is intelligent enough to have a good caching mechanism without 
additional indices if the database is small.


> Regards,
> Ken
> 
-- 
Kind Regards from Switzerland,

Stevan Bajić

------------------------------------------------------------------------------
This SF.Net email is sponsored by the Verizon Developer Community
Take advantage of Verizon's best-in-class app development support
A streamlined, 14 day to market process makes app distribution fast and easy
Join now and get one step closer to millions of Verizon customers
http://p.sf.net/sfu/verizon-dev2dev 
_______________________________________________
Dspam-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspam-devel

Re: [Dspam-devel] Better purge.sql for MySQL

Reply via email to