http://bugzilla.spamassassin.org/show_bug.cgi?id=3771
------- Additional Comments From [EMAIL PROTECTED] 2004-11-18 22:36 ------- -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/18/2004 3:38 PM, Michael Parker wrote: > On Thu, Nov 18, 2004 at 06:53:19AM -0800, Rupa Schomaker wrote: > >>Some questions: >> >>Is bytea really necessary? If I follow the path of the patch, the bytea >>change was done prior to adding the index. Since the tokens are binary >>data it is probably more correct through, especially if one has a >>encoding other than SQL_ASCII set for the DB... > > > Yes, as far as I can tell from the documentation. The fact that we're > storing the binary value makes it necessary. If I'm misinformed, then > feel free to point out where in the documentation. My understanding is that isn't necessary but it is more fragile (subject to the database encoding and the client encoding). This was discussed recently on one of the postgres groups... Looking: <http://groups.google.com/groups?hl=en&lr=&selm=cndnbc%24otp%241%40FreeBSD.csie.NCTU.edu.tw> Message-ID: <[EMAIL PROTECTED]> === From: Tom Lane ([EMAIL PROTECTED]) Subject: Re: [ADMIN] evil characters #bfef cause dump failure Date: 2004-11-16 12:19:06 PST [snip] BTW, SQL_ASCII is not so much an encoding as the absence of any encoding choice; it just passes 8-bit data with no interpretation. So it's not *that* unreasonable a default. You can store UTF8 data in it without any problem, you just won't have the niceties like detection of bad character sequences. regards, tom lane === Leave it as bytea... >>What do you use to benchmark changes? I'm willing to experiment but >>would like to have some reproducable results for ya... > > > It's not really ready for real world consumption and time has been > short for getting it ready. You can read a little about it here: > http://wiki.apache.org/spamassassin/BayesBenchmark > > Hopefully, I'll get some free time soon and get it into the SA tree. I'll take a look at it when I get a chance. Some more testing/observations with sa-learn only. BTW: do you want me to move this discussion to the ticket in bugzilla? Or we can wait 'till I/we have a summary... General notes: 1) Why not a unique index that mimics the primary key (though do it in token,id order not id,token)? Won't matter in my case (since I run as one user) and probably doen't matter at all unless running with lots 'n lots of users... 2) bayes_seen.msgid should be type 'text' -- sa-learn (and others) don't truncate to 200. 3) I also get differences in the backup file. - -rw-r--r-- 1 rupa users 13047214 Nov 18 13:23 backup_dbm.txt - -rw-r--r-- 1 rupa users 13047202 Nov 18 17:16 backup_new.txt An actual diff is probably meaningless since I doubt order is guaranteed between a dbm and sql. I did the diff and quickly gave up. I suppose the data could be ordered from both sources and then compared? Some 'benchmarks' of sa-learn. Single run: bayes_seen: 202863 rows bayes_token: 150842 rows System is: model name : AMD Athlon(tm) XP 2600+ MemTotal: 1031916 kB debian unstable With a fairly large workload from a memory standpoint but CPU generally fairly idle. Postgres hasn't been tuned "much" -- have to reset the stats in postgres and do some analysis... 1) Shipped config with msgid='text' on my backup file: real 24m35.663s 2) Shipped config with indices added: real 32m33.931s Ekk! Analyze; delete; rerun: Still 30min. hrmmm.. But I know it runs better in normal operation. Oh well *shrug* must be the index update even though the check constraint doesn't need a table scan. 3) Patch (2004-10-31 18:53) applied, re-create tables: real 14m29.793s Analyze, delete, rerun: 15m. A bit better. BTW: Using dbm the full restore takes 23s... Time to add some small amount of stats to sa-learn (or underlying) to see where we're spending time... Added some more timing points and dbg() output to SQL.pm. Needs Time::HiRes which is bundled in perl 5.8.x but is an optional add-on for earlier stuff. Ok, with my large set: Token inserts start at around 1-2s per 1000 and rises to 7-8s per 1000. Seen inserts start at around 1s per 1000 and stay there. I can think of ways to optimize sa-learn (do it all in one TX rather than 1TX per insert), assume an insert rather than using the generic query then insert path for _put_token() but the restore is only done once anyway and the changes would require some invasive changes rather than just re-using existing logic.... Not worth it. It is however a reasonable test of the insert/update logic of learning a single message (whether auto-learn or manual). Doesn't test the query side though... > > Michael - -- -Rupa -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBnYS/L3Aub+krmycRAuioAJ9bh224fxsAvUTX9liLQ1pf/wYIVACgxBDQ SllANDuelO8OWEwqOWZ9FsM= =1cIx -----END PGP SIGNATURE----- ------- You are receiving this mail because: ------- You are the assignee for the bug, or are watching the assignee.