http://bugzilla.spamassassin.org/show_bug.cgi?id=3872
Summary: SA 3.0 creates randomly extreme big bayes_journal
Product: Spamassassin
Version: 3.0.0
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: spamassassin
AssignedTo: [email protected]
ReportedBy: [EMAIL PROTECTED]
Since using SA 3.0, starting with RC2, we occasionally get SA time outs when
processing mail, the cause being the bayes_journal. That is at least what I
thought first, but it's also possible that the basic cause is the bayes expiry
method (see below for more) and the journal growing to extreme sizes somehow
is a side effect of that. SA is used via MailScanner (currently 4.32.5), Perl
is 5.8.1, OS is Suse 9.0. When going from RC2 to RC4 it took about three weeks
to reappear, that's why I didn't file a bug earlier. I went from RC4 to RTM
last week and it happened again yesterday and today. Maybe I just need to
update a certain Perl module?
What happens is that the bayes_journal suddenly is big as hell, ranging from
maybe 150 MB to 300 MB. So, it doesn't grow over time which could mean it
never gets synced but it happens "suddenly". I first thought the sync is then
triggered and because of the size of the file it takes quite long, so long
that the time out in MailScanner (120 seconds) kicks in. Afterwards, when I
look in the directory I most often see either a big journal file or a big
journal.old file or both. It can also happen that I just see nothing unusual.
Then apparently the sync finished successfully *after* the time out, which
surprises me a bit. Since MS uses the SA library the processing should stop
immediately with the time out.
When a big bayes_journal file is present I can run "sa-learn --sync" and after
a long time (about those 120 seconds time out settings) it finishes
successfully.
Also, at the time the timeout occurs *and* afterwards several hundred MB of
swapspace are shown as being in use although there should be enough or almost
enough of free RAM to handle the big journal file. The swapspace taken is only
regained when I restart MailScanner. It seems the problem is likely to
reappear soon when I don't do the restart (just the sync), but that's just a
feeling.
Here are some examples of the bayes dir. Note, that bayes_toks stays the same
size all the time. Is this a hint? (see below for more)
-rw-rw-rw- 1 spamd www 36 Sep 11 12:43 bayes.mutex
-rw-rw-rw- 1 root www 12968 Sep 11 13:13 bayes_journal
-rw-rw-rw- 1 root www 170591392 Sep 11 12:04 bayes_journal.old
-rw-rw-rw- 1 spamd www 2408448 Sep 11 11:47 bayes_seen
-rw-rw-rw- 1 spamd www 20951040 Sep 11 11:47 bayes_toks
-rw-rw-rw- 1 spamd www 42 Sep 26 10:31 bayes.mutex
-rw-rw-rw- 1 root www 113375340 Sep 26 10:31 bayes_journal
-rw-rw-rw- 1 root www 226743232 Sep 26 10:26 bayes_journal.old
-rw-rw-rw- 1 spamd www 2482176 Sep 26 10:26 bayes_seen
-rw-rw-rw- 1 spamd www 20951040 Sep 26 10:26 bayes_toks
-rw-rw-rw- 1 spamd www 42 Oct 1 00:01 bayes.mutex
-rw-rw-rw- 1 root www 91219612 Oct 1 00:02 bayes_journal
-rw-rw-rw- 1 spamd www 2482176 Oct 1 00:02 bayes_seen
-rw-rw-rw- 1 spamd www 20951040 Oct 1 00:02 bayes_toks
-rw-rw-rw- 1 spamd www 42 Oct 5 13:01 bayes.mutex
-rw-rw-rw- 1 root www 167110424 Oct 5 13:05 bayes_journal
-rw-rw-rw- 1 spamd www 2482176 Oct 5 12:25 bayes_seen
-rw-rw-rw- 1 spamd www 20951040 Oct 5 13:03 bayes_toks
A typical free output, as mentioned above:
total used free shared buffers cached
Mem: 514948 364916 150032 0 13400 51128
-/+ buffers/cache: 300388 214560
Swap: 514040 261756 252284
Then, I also checked today if there could be other problems and there are.
sa-learn dump magic:
0.000 0 3 0 non-token data: bayes db version
0.000 0 21714 0 non-token data: nspam
0.000 0 6950 0 non-token data: nham
0.000 0 784903 0 non-token data: ntokens
0.000 0 1052059392 0 non-token data: oldest atime
0.000 0 1096983312 0 non-token data: newest atime
0.000 0 1096983409 0 non-token data: last journal sync atime
0.000 0 1094773234 0 non-token data: last expiry atime
0.000 0 29754654 0 non-token data: last expire atime delta
0.000 0 36 0 non-token data: last expire reduction
count
sa-learn --force-expire excerpt:
debug: bayes: expiry check keep size, 0.75 * max: 750000
debug: bayes: token count: 784903, final goal reduction size: 34903
debug: bayes: First pass? Current: 1096983709, Last: 1094773234, atime:
29754654, count: 36, newdelta: 30689, ratio: 969.527777777778, period: 43200
debug: bayes: Can't use estimation method for expiry, something fishy,
calculating optimal atime delta (first pass)
debug: bayes: expiry max exponent: 9
debug: bayes: atime token reduction
debug: bayes: ======== ===============
debug: bayes: 43200 781347
debug: bayes: 86400 776905
debug: bayes: 172800 771548
debug: bayes: 345600 764197
debug: bayes: 691200 751936
debug: bayes: 1382400 731870
debug: bayes: 2764800 689944
debug: bayes: 5529600 657689
debug: bayes: 11059200 649542
debug: bayes: 22118400 545825
debug: bayes: couldn't find a good delta atime, need more token difference,
skipping expire.
So, this problem is still there in SA 3. Of course, this could also be the
cause of the time-outs, but how come the big bayes_journal files into play?
Also, I think if the expiry causes the time-out it should happen *each* time a
message is processed, until the expiry succeeds (which it can't) - or not?
So, this might be indeed the core of the problem or just coincidence.
As for the expire problem, if I understand this correctly, it seems to happen
because it stops after ten tries. Going back 22118400 is January 2004, but
apparently there are still 500.000 tokens before that date (the oldest data is
from May 2003). If I'm correct the problem only occurs because we have chosen
to maintain a bayes db for a longer time, so that no expiry was done at all
for a long time. And then, when the expiry is about to happen it doesn't work
because the tokens are spread over a wide time range. I checked others of our
dbs (all slightly under the expiry threshold) and found the same problem. When
I reduced the expiry threshold from 750.000 to 100.000 on one of them the
expiry worked because the tenth try was about to fit. Of course, that shrunk
it from 715.000 to 162.000. I still think this is a basic flaw in the way
bayes expiry works (we already discussed this early this year on the users
mailing lists.), because if I'm right anyone who wants to keep a bigger db
over a longer time period is going to have this trouble.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.