Matthew Schumacher wrote: >Tom Lane wrote: > > > >>I looked into this a bit. It seems that the problem when you wrap the >>entire insertion series into one transaction is associated with the fact >>that the test does so many successive updates of the single row in >>bayes_vars. (VACUUM VERBOSE at the end of the test shows it cleaning up >>49383 dead versions of the one row.) This is bad enough when it's in >>separate transactions, but when it's in one transaction, none of those >>dead row versions can be marked "fully dead" yet --- so for every update >>of the row, the unique-key check has to visit every dead version to make >>sure it's dead in the context of the current transaction. This makes >>the process O(N^2) in the number of updates per transaction. Which is >>bad enough if you just want to do one transaction per message, but it's >>intolerable if you try to wrap the whole bulk-load scenario into one >>transaction. >> >>I'm not sure that we can do anything to make this a lot smarter, but >>in any case, the real problem is to not do quite so many updates of >>bayes_vars. >> >>How constrained are you as to the format of the SQL generated by >>SpamAssassin? In particular, could you convert the commands generated >>for a single message into a single statement? I experimented with >>passing all the tokens for a given message as a single bytea array, >>as in the attached, and got almost a factor of 4 runtime reduction >>on your test case. >> >>BTW, it's possible that this is all just a startup-transient problem: >>once the database has been reasonably well populated, one would expect >>new tokens to be added infrequently, and so the number of updates to >>bayes_vars ought to drop off. >> >> regards, tom lane >> >> >> > >The spamassassins bayes code calls the _put_token method in the storage >module a loop. This means that the storage module isn't called once per >message, but once per token. > > Well, putting everything into a transaction per email might make your pain go away. If you saw the email I just sent, I modified your data.sql file to add a "COMMIT;BEGIN" every 1000 selects, and I saw a performance jump from 18 minutes down to less than 2 minutes. Heck, on my machine, the advanced perl version takes more than 2 minutes to run. It is actually slower than the data.sql with commit statements.
>I'll look into modifying it to so that the bayes code passes a hash of >tokens to the storage module where they can loop or in the case of the >pgsql module pass an array of tokens to a procedure where we loop and >use temp tables to make this much more efficient. > > Well, you could do that. Or you could just have the bayes code issue "BEGIN;" when it starts processing an email, and a "COMMIT;" when it finishes. From my testing, you will see an enormous speed improvement. (And you might consider including a fairly frequent VACUUM ANALYZE) >I don't have much time this weekend to toss at this, but will be looking >at it on Monday. > > Good luck, John =:-> >Thanks, > >schu > >---------------------------(end of broadcast)--------------------------- >TIP 5: don't forget to increase your free space map settings > > >
Description: OpenPGP digital signature