David F. Skoll wrote:
> Matthew Schumacher wrote:
>
> Have you thought about simply doing:
>
> SELECT * FROM bayes WHERE token in ('tok1', 'tok2', ..., 'tokN')
>
> It seems to me that should be just as fast, and not rely on PostgreSQL
> features or stored procedures.
>
> You have to be careful with messages that have extremely large numbers
> of tokens; you might need to split the query into chunks of 1000
> tokens each or something like that.
>
Yes, we tried that. I attached another version of the proc that gets
rid of the looping altogether, but believe it or not, it's slower.
The reason why we rely on procs is because the SA code doesn't have
transactions yet. By passing the tokens in as an array we get
transactions because pgsql treats each run of a proc as a trans.
>
>>Here is the new benchmark:
>>http://wiki.apache.org/spamassassin/BayesBenchmarkResults
>
>
> Those results are extremely surprising. Our CanIt benchmarks show
> Berkeley DB outperforming PostgreSQL by a factor of 6 to 10, but your
> benchmarks show them about equal. Something is funny there... I
> wonder if it could be that CanIt never locks the BDB files, whereas
> SpamAssassin does? If that's the case, then there's still tremendous
> room for improvement on the BDB side.
Yea the SA bdb code is a locking headache. I have never really coded
against bdb so I have no idea if it's good, bad, or otherwise.
>
> Also, I don't think the "fsync=false" column should even be presented.
> Nobody who cares about his/her data runs PostgreSQL like that, so the
> timings in that column are unachievable in real-world situations.
I agree that it's useless in the real world, but it is interesting to
see how much time the sync takes. Before grouping tokens into procs
(read transactions) sync was 25x slower. Now that they are pretty close
speed wise, that tells you that we are not spending to much time doing a
sync.
>
> Ironically, just as SpamAssassin is making strides with a centralized
> SQL database, in CanIt, we've revised our thinking and started moving
> to distributed BDB databases. :-)
You mean having a separate bdb for each key/val pair? I proposed this
to the sa people for the AWL and they shot it down. Like I said, I'm
ignorant when it comes to bdb.
As much as I love pgsql and think it's light years beyond mysql in
features, stability, flexibility, and even performance (when doing
complex queries against large tables), I am now testing mysql for my
bayes store. Mysql has one good thing going for it, raw speed on simple
queries, which is really the only requirement for bayes.
create index bayes_token_id_idx on bayes_token (id);
CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
intokenary BYTEA[],
inspam_count INTEGER,
inham_count INTEGER,
inatime INTEGER)
RETURNS VOID AS '
BEGIN
UPDATE
bayes_token
SET
spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count , 0),
atime = greatest_int(atime, inatime)
WHERE
id = inuserid
AND
token IN (SELECT bayes_token_tmp FROM bayes_token_tmp(intokenary));
UPDATE
bayes_vars
SET
token_count = token_count + (
SELECT
count(bayes_token_tmp)
FROM
bayes_token_tmp(intokenary)
WHERE
NOT EXISTS (SELECT token FROM bayes_token WHERE token = bayes_token_tmp)),
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
WHERE
id = inuserid;
INSERT INTO
bayes_token
SELECT
inuserid,
bayes_token_tmp,
inspam_count,
inham_count,
inatime
FROM
bayes_token_tmp(intokenary)
WHERE
(inspam_count > 0 OR inham_count > 0)
AND
NOT EXISTS (SELECT token FROM bayes_token WHERE token = bayes_token_tmp);
RETURN;
END;
' LANGUAGE 'plpgsql';
CREATE OR REPLACE FUNCTION bayes_token_tmp(intokenary BYTEA[]) RETURNS SETOF bytea AS
'
BEGIN
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
return next intokenary[i];
END LOOP;
RETURN;
end
'
language 'plpgsql';
CREATE OR REPLACE FUNCTION greatest_int (integer, integer)
RETURNS INTEGER
IMMUTABLE STRICT
AS 'SELECT CASE WHEN $1 < $2 THEN $2 ELSE $1 END;'
LANGUAGE SQL;
CREATE OR REPLACE FUNCTION least_int (integer, integer)
RETURNS INTEGER
IMMUTABLE STRICT
AS 'SELECT CASE WHEN $1 < $2 THEN $1 ELSE $2 END;'
LANGUAGE SQL;
_______________________________________________
Visit http://www.mimedefang.org and http://www.roaringpenguin.com
MIMEDefang mailing list
[email protected]
http://lists.roaringpenguin.com/mailman/listinfo/mimedefang