https://issues.apache.org/SpamAssassin/show_bug.cgi?id=4400

--- Comment #17 from Mark Martinec <[email protected]> 2010-06-14 14:22:54 
EDT ---
> [...] smallish set of 5000 tokens that accumulated over [...]

Sorry, the figure was 'messages', not 'tokens'. The rest stands.

Now that our bayes database has grown to 10.000 learned messages
and 200.000 tokens, I repeated the measurements, switching between
original and the hereby suggested index scheme on table bayes_token,
and back.

I was observing elapsed times in milliseconds for the tok_get_all
(read), and tok_touch_all operations (update), which correspond
to all SQL I/O in the Bayes plugin - the rest is tokenization
and computing probabilities, both of which is just perl processing
with no I/O. Messages were just our regular mail traffic. Results
were plotted as a scattergram of elapsed ms vs. time-of-the-day.

I must say that the change hardly makes any difference. It is
interesting that both the tok_get_all times and the tok_touch_all
times are multimodal, i.e. the elapsed times are grouped in two or
three regions. The tok_get_all clusters are roughly at 8, 25 and
50 milliseconds, while tok_touch_all times are near 5 and 35 ms.

Switching the index scheme only affects the upper cluster by very
little: 1 or 2 ms out of 35, and perhaps 4 ms out of 50 ms.
The suggested index scheme saves a little in updating, and loses
a little while reading (select). Compared to the total time
spent in bayes processing the change is insignificant.

It is interesting that the ratio of time spent in SQL I/O vs. the
total time spent in bayes is almost entirely in the 45% .. 55%
range, i.e. about half of the time is due to SQL, the other half
is spent in tokenization and computing the probability.

In summary: dropping the unnecessary index and swapping the fields
of a primary key (as suggested here) can save some unnecessary work
for the SQL server, and can save some space, but makes no difference
in SpamAssassin performance, at least in my case of a 200.000 token
database and a PostgreSQL 8.3.11 server.

Since the suggested change is just to a documentation file
(sql/bayes_pg.sql), my choice would be to still go for it,
can't hurt.

Other benchmarking on a larger database is welcome...

-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to