[Bug 3771] PostgreSQL Specific Bayes Storage Module

bugzilla-daemon 19 Nov 2004 07:07:18 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=3771






------- Additional Comments From [EMAIL PROTECTED]  2004-11-18 22:48 -------
Subject: Re:  PostgreSQL Specific Bayes Storage Module

On Thu, Nov 18, 2004 at 10:36:45PM -0800, [EMAIL PROTECTED] wrote:
> > 
> >>Some questions:
> >>
> >>Is bytea really necessary?  If I follow the path of the patch, the bytea
> >>change was done prior to adding the index.  Since the tokens are binary
> >>data it is probably more correct through, especially if one has a
> >>encoding other than SQL_ASCII set for the DB...
> > 
> > 
> > Yes, as far as I can tell from the documentation.  The fact that we're
> > storing the binary value makes it necessary.  If I'm misinformed, then
> > feel free to point out where in the documentation.
> 
> My understanding is that isn't necessary but it is more fragile (subject
> to the database encoding and the client encoding).  This was discussed
> recently on one of the postgres groups... Looking:
> 
> <http://groups.google.com/groups?hl=en&lr=&selm=cndnbc%24otp%241%40FreeBSD.csie.NCTU.edu.tw>
> Message-ID: <[EMAIL PROTECTED]>
> 
> ===
> From: Tom Lane ([EMAIL PROTECTED])
> Subject: Re: [ADMIN] evil characters #bfef cause dump failure
> Date: 2004-11-16 12:19:06 PST
> 
> [snip]
> BTW, SQL_ASCII is not so much an encoding as the absence of any encoding
> choice; it just passes 8-bit data with no interpretation.  So it's not
> *that* unreasonable a default.  You can store UTF8 data in it without
> any problem, you just won't have the niceties like detection of bad
> character sequences.
> 
>    regards, tom lane
> ===
> 
> Leave it as bytea...
> 

Interesting, I think my main concern was the fact that BYTEA was the
only way to make sure you got any trailing whitespace (which we do
get) so it had to be used.  Like I said, I'm far from the postgresql
expert so I'm gladly proven wrong.

> 1) Why not a unique index that mimics the primary key (though do it in
> token,id order not id,token)?  Won't matter in my case (since I run as
> one user) and probably doen't matter at all unless running with lots 'n
> lots of users...

Didn't realize it was necessary.

> 2) bayes_seen.msgid should be type 'text' -- sa-learn (and others) don't
> truncate to 200.

We should just truncate in the code, maybe it needs to be a little
bigger but add a hard substr to the code anyway.

> 3) I also get differences in the backup file.
> 
> - -rw-r--r--  1 rupa users 13047214 Nov 18 13:23 backup_dbm.txt
> - -rw-r--r--  1 rupa users 13047202 Nov 18 17:16 backup_new.txt
> 
> An actual diff is probably meaningless since I doubt order is guaranteed
> between a dbm and sql.  I did the diff and quickly gave up.  I suppose
> the data could be ordered from both sources and then compared?
> 

This is a problem, see the bug for a short discussion.  There is for
sure some differences in output that should not be there.

> Ok, with my large set:
> 
> Token inserts start at around 1-2s per 1000 and rises to 7-8s per 1000.
> 
> Seen inserts start at around 1s per 1000 and stay there.
> 

I started running the auto analyzer deal to keep the statistics
up-to-date, this helps keep from trailing off later in the run.

> I can think of ways to optimize sa-learn (do it all in one TX rather
> than 1TX per insert), assume an insert rather than using the generic
> query then insert path for _put_token() but the restore is only done
> once anyway and the changes would require some invasive changes rather
> than just re-using existing logic....  Not worth it.

Yeah, it would require a fairly large change all around.

Michael




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3771] PostgreSQL Specific Bayes Storage Module

Reply via email to