Re: [Mimedefang] Spamassassin 3.1 and improved bayes/sql.

Matthew Schumacher Wed, 17 Aug 2005 12:42:28 -0700

David F. Skoll wrote:
> Matthew Schumacher wrote:
> 
> Have you thought about simply doing:
> 
>      SELECT * FROM bayes WHERE token in ('tok1', 'tok2', ..., 'tokN')
> 
> It seems to me that should be just as fast, and not rely on PostgreSQL
> features or stored procedures.
> 
> You have to be careful with messages that have extremely large numbers
> of tokens; you might need to split the query into chunks of 1000
> tokens each or something like that.
>


Yes, we tried that.  I attached another version of the proc that gets
rid of the looping altogether, but believe it or not, it's slower.

The reason why we rely on procs is because the SA code doesn't have
transactions yet.  By passing the tokens in as an array we get
transactions because pgsql treats each run of a proc as a trans.

> 
>>Here is the new benchmark:
>>http://wiki.apache.org/spamassassin/BayesBenchmarkResults
> 
> 
> Those results are extremely surprising.  Our CanIt benchmarks show
> Berkeley DB outperforming PostgreSQL by a factor of 6 to 10, but your
> benchmarks show them about equal.  Something is funny there...  I
> wonder if it could be that CanIt never locks the BDB files, whereas
> SpamAssassin does?  If that's the case, then there's still tremendous
> room for improvement on the BDB side.

Yea the SA bdb code is a locking headache.  I have never really coded
against bdb so I have no idea if it's good, bad, or otherwise.

> 
> Also, I don't think the "fsync=false" column should even be presented.
> Nobody who cares about his/her data runs PostgreSQL like that, so the
> timings in that column are unachievable in real-world situations.

I agree that it's useless in the real world, but it is interesting to
see how much time the sync takes.  Before grouping tokens into procs
(read transactions) sync was 25x slower.  Now that they are pretty close
speed wise, that tells you that we are not spending to much time doing a
sync.

> 
> Ironically, just as SpamAssassin is making strides with a centralized
> SQL database, in CanIt, we've revised our thinking and started moving
> to distributed BDB databases. :-)

You mean having a separate bdb for each key/val pair?  I proposed this
to the sa people for the AWL and they shot it down.  Like I said, I'm
ignorant when it comes to bdb.

As much as I love pgsql and think it's light years beyond mysql in
features, stability, flexibility, and even performance (when doing
complex queries against large tables), I am now testing mysql for my
bayes store.  Mysql has one good thing going for it, raw speed on simple
queries, which is really the only requirement for bayes.

create index bayes_token_id_idx on bayes_token (id);

CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
                                     intokenary BYTEA[],
                                     inspam_count INTEGER,
                                     inham_count INTEGER,
                                     inatime INTEGER)
RETURNS VOID AS ' 
BEGIN

  UPDATE
    bayes_token
  SET
    spam_count = greatest_int(spam_count + inspam_count, 0),
    ham_count = greatest_int(ham_count + inham_count , 0),
    atime = greatest_int(atime, inatime)
  WHERE
    id = inuserid
  AND
    token IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));

  UPDATE
    bayes_vars
  SET
    token_count = token_count + (
      SELECT 
        count(bayes_token_tmp) 
      FROM 
        bayes_token_tmp(intokenary) 
      WHERE 
        NOT EXISTS (SELECT token FROM bayes_token WHERE token = bayes_token_tmp)),
    newest_token_age = greatest_int(newest_token_age, inatime),
    oldest_token_age = least_int(oldest_token_age, inatime)
  WHERE
    id = inuserid;

  INSERT INTO 
    bayes_token 
    SELECT 
      inuserid, 
      bayes_token_tmp,
      inspam_count,
      inham_count,
      inatime
    FROM 
      bayes_token_tmp(intokenary)
    WHERE
      (inspam_count > 0 OR inham_count > 0)
    AND
      NOT EXISTS (SELECT token FROM bayes_token WHERE token = bayes_token_tmp);

  RETURN;
END; 
' LANGUAGE 'plpgsql'; 

CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS SETOF bytea AS
'
BEGIN
  for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
  LOOP
    return next intokenary[i];
  END LOOP;
  RETURN;
end
'
language 'plpgsql';


CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
 LANGUAGE SQL;

CREATE OR REPLACE FUNCTION least_int (integer, integer)
 RETURNS INTEGER
 IMMUTABLE STRICT
 AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
 LANGUAGE SQL;

_______________________________________________
Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list
[email protected]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang

Re: [Mimedefang] Spamassassin 3.1 and improved bayes/sql.

Reply via email to