Hi All, As a SpamAssassin developer, who by my own admission has real problem getting PostgreSQL to work well, I must thank everyone for their feedback on this issue. Believe me when I say what is in the tree now is a far cry from what used to be there, orders of magnitude faster for sure. I think there are several good ideas that have come out of this thread and I've set about attempting to implement them.
Here is a version of the stored procedure, based in large part by the
one written by Tom Lane, that accepts and array of tokens and loops
over them to either update or insert them into the database (I'm not
including the greatest_int/least_int procedures but you've all seen
them before):
CREATE OR REPLACE FUNCTION put_tokens(inuserid INTEGER,
intokenary BYTEA[],
inspam_count INTEGER,
inham_count INTEGER,
inatime INTEGER)
RETURNS VOID AS '
DECLARE
_token BYTEA;
new_tokens INTEGER := 0;
BEGIN
for i in array_lower(intokenary, 1) .. array_upper(intokenary, 1)
LOOP
_token := intokenary[i];
UPDATE bayes_token
SET spam_count = greatest_int(spam_count + inspam_count, 0),
ham_count = greatest_int(ham_count + inham_count, 0),
atime = greatest_int(atime, inatime)
WHERE id = inuserid
AND token = _token;
IF NOT FOUND THEN
-- we do not insert negative counts, just return true
IF NOT (inspam_count < 0 OR inham_count < 0) THEN
INSERT INTO bayes_token (id, token, spam_count,
ham_count, atime)
VALUES (inuserid, _token, inspam_count, inham_count, inatime);
IF FOUND THEN
new_tokens := new_tokens + 1;
END IF;
END IF;
END IF;
END LOOP;
UPDATE bayes_vars
SET token_count = token_count + new_tokens,
newest_token_age = greatest_int(newest_token_age, inatime),
oldest_token_age = least_int(oldest_token_age, inatime)
WHERE id = inuserid;
RETURN;
END;
' LANGUAGE 'plpgsql';
This version is about 32x faster than the old version, with the
default fsync value and autovacuum running in the background.
The next hurdle, and I've just posted to the DBD::Pg list, is
escaping/quoting the token strings. They are true binary strings,
substrings of SHA1 hashes, I don't think the original data set
provided puts them in the right context. They have proved to be
tricky. I'm unable to call the stored procedure from perl because I
keep getting a malformed array litteral error.
Here is some example code that shows the issue:
#!/usr/bin/perl -w
# from a new db, do this first
# INSERT INTO bayes_vars VALUES (1,'nobody',0,0,0,0,0,0,2147483647,0);
use strict;
use DBI;
use DBD::Pg qw(:pg_types);
use Digest::SHA1 qw(sha1);
my $dbh = DBI->connect("DBI:Pg:dbname=spamassassin","postgres") || die;
my @dataary;
# Input is just a list of words (ie /usr/share/dict/words) stop after 150
while(<>) {
chomp;
push(@dataary, substr(sha1($_), -5));
# to see it work with normal string comment out above and uncomment below
# push(@dataary, $_);
last if scalar(@dataary) >= 150;
}
my $datastring = join(",", map { '"' . bytea_esc($_) . '"' }
@dataary);
my $sql = "select put_tokens(1, '{$datastring}', 1, 1, 10000)";
my $sth = $dbh->prepare($sql);
my $rc = $sth->execute();
unless ($rc) {
print "Error: " . $dbh->errstr() . "\n";
}
$sth->finish();
sub bytea_esc {
my ($str) = @_;
my $buf = "";
foreach my $char (split(//,$str)) {
if (ord($char) == 0) { $buf .= "\\\\000"; }
elsif (ord($char) == 39) { $buf .= "\\\\047"; }
elsif (ord($char) == 92) { $buf .= "\\\\134"; }
else { $buf .= $char; }
}
return $buf;
}
Any ideas? or thoughts on the revised procedure? I'd greatly
appriciate them.
Sorry for the length, but hopefully it give a good enough example.
Thanks
Michael Parker
pgpGUVHtkro1y.pgp
Description: PGP signature
