Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

John Arbash Meinel Sat, 30 Jul 2005 22:42:12 -0700

Matthew Schumacher wrote:

>Tom Lane wrote:
>
>  
>
>>I looked into this a bit.  It seems that the problem when you wrap the
>>entire insertion series into one transaction is associated with the fact
>>that the test does so many successive updates of the single row in
>>bayes_vars.  (VACUUM VERBOSE at the end of the test shows it cleaning up
>>49383 dead versions of the one row.)  This is bad enough when it's in
>>separate transactions, but when it's in one transaction, none of those
>>dead row versions can be marked "fully dead" yet --- so for every update
>>of the row, the unique-key check has to visit every dead version to make
>>sure it's dead in the context of the current transaction.  This makes
>>the process O(N^2) in the number of updates per transaction.  Which is
>>bad enough if you just want to do one transaction per message, but it's
>>intolerable if you try to wrap the whole bulk-load scenario into one
>>transaction.
>>
>>I'm not sure that we can do anything to make this a lot smarter, but
>>in any case, the real problem is to not do quite so many updates of
>>bayes_vars.
>>
>>How constrained are you as to the format of the SQL generated by
>>SpamAssassin?  In particular, could you convert the commands generated
>>for a single message into a single statement?  I experimented with
>>passing all the tokens for a given message as a single bytea array,
>>as in the attached, and got almost a factor of 4 runtime reduction
>>on your test case.
>>
>>BTW, it's possible that this is all just a startup-transient problem:
>>once the database has been reasonably well populated, one would expect
>>new tokens to be added infrequently, and so the number of updates to
>>bayes_vars ought to drop off.
>>
>>                      regards, tom lane
>>
>>    
>>
>
>The spamassassins bayes code calls the _put_token method in the storage
>module a loop.  This means that the storage module isn't called once per
>message, but once per token.
>  
>
Well, putting everything into a transaction per email might make your
pain go away.
If you saw the email I just sent, I modified your data.sql file to add a
"COMMIT;BEGIN" every 1000 selects, and I saw a performance jump from 18
minutes down to less than 2 minutes. Heck, on my machine, the advanced
perl version takes more than 2 minutes to run. It is actually slower
than the data.sql with commit statements.


>I'll look into modifying it to so that the bayes code passes a hash of
>tokens to the storage module where they can loop or in the case of the
>pgsql module pass an array of tokens to a procedure where we loop and
>use temp tables to make this much more efficient.
>  
>
Well, you could do that. Or you could just have the bayes code issue
"BEGIN;" when it starts processing an email, and a "COMMIT;" when it
finishes. From my testing, you will see an enormous speed improvement.
(And you might consider including a fairly frequent VACUUM ANALYZE)

>I don't have much time this weekend to toss at this, but will be looking
>at it on Monday.
>  
>
Good luck,
John
=:->

>Thanks,
>
>schu
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: don't forget to increase your free space map settings
>
>  
>

signature.asc
Description: OpenPGP digital signature

Re: [PERFORM] Performance problems testing with Spamassassin 3.1.0

Reply via email to