http://bugzilla.spamassassin.org/show_bug.cgi?id=3872

           Summary: SA 3.0 creates randomly extreme big bayes_journal
           Product: Spamassassin
           Version: 3.0.0
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P5
         Component: spamassassin
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


Since using SA 3.0, starting with RC2, we occasionally get SA time outs when 
processing mail, the cause being the bayes_journal. That is at least what I 
thought first, but it's also possible that the basic cause is the bayes expiry 
method (see below for more) and the journal growing to extreme sizes somehow 
is a side effect of that. SA is used via MailScanner (currently 4.32.5), Perl 
is 5.8.1, OS is Suse 9.0. When going from RC2 to RC4 it took about three weeks 
to reappear, that's why I didn't file a bug earlier. I went from RC4 to RTM 
last week and it happened again yesterday and today. Maybe I just need to 
update a certain Perl module?

What happens is that the bayes_journal suddenly is big as hell, ranging from 
maybe 150 MB to 300 MB. So, it doesn't grow over time which could mean it 
never gets synced but it happens "suddenly". I first thought the sync is then 
triggered and because of the size of the file it takes quite long, so long 
that the time out in MailScanner (120 seconds) kicks in. Afterwards, when I 
look in the directory I most often see either a big journal file or a big 
journal.old file or both. It can also happen that I just see nothing unusual. 
Then apparently the sync finished successfully *after* the time out, which 
surprises me a bit. Since MS uses the SA library the processing should stop 
immediately with the time out.
When a big bayes_journal file is present I can run "sa-learn --sync" and after 
a long time (about those 120 seconds time out settings) it finishes 
successfully.
Also, at the time the timeout occurs *and* afterwards several hundred MB of 
swapspace are shown as being in use although there should be enough or almost 
enough of free RAM to handle the big journal file. The swapspace taken is only 
regained when I restart MailScanner. It seems the problem is likely to 
reappear soon when I don't do the restart (just the sync), but that's just a 
feeling.
Here are some examples of the bayes dir. Note, that bayes_toks stays the same 
size all the time. Is this a hint? (see below for more)

-rw-rw-rw-    1 spamd    www            36 Sep 11 12:43 bayes.mutex
-rw-rw-rw-    1 root     www         12968 Sep 11 13:13 bayes_journal
-rw-rw-rw-    1 root     www      170591392 Sep 11 12:04 bayes_journal.old
-rw-rw-rw-    1 spamd    www       2408448 Sep 11 11:47 bayes_seen
-rw-rw-rw-    1 spamd    www      20951040 Sep 11 11:47 bayes_toks

-rw-rw-rw-    1 spamd    www            42 Sep 26 10:31 bayes.mutex
-rw-rw-rw-    1 root     www      113375340 Sep 26 10:31 bayes_journal
-rw-rw-rw-    1 root     www      226743232 Sep 26 10:26 bayes_journal.old
-rw-rw-rw-    1 spamd    www       2482176 Sep 26 10:26 bayes_seen
-rw-rw-rw-    1 spamd    www      20951040 Sep 26 10:26 bayes_toks

-rw-rw-rw-    1 spamd    www            42 Oct  1 00:01 bayes.mutex
-rw-rw-rw-    1 root     www      91219612 Oct  1 00:02 bayes_journal
-rw-rw-rw-    1 spamd    www       2482176 Oct  1 00:02 bayes_seen
-rw-rw-rw-    1 spamd    www      20951040 Oct  1 00:02 bayes_toks

-rw-rw-rw-    1 spamd    www            42 Oct  5 13:01 bayes.mutex
-rw-rw-rw-    1 root     www      167110424 Oct  5 13:05 bayes_journal
-rw-rw-rw-    1 spamd    www       2482176 Oct  5 12:25 bayes_seen
-rw-rw-rw-    1 spamd    www      20951040 Oct  5 13:03 bayes_toks

A typical free output, as mentioned above:
             total       used       free     shared    buffers     cached
Mem:        514948     364916     150032          0      13400      51128
-/+ buffers/cache:     300388     214560
Swap:       514040     261756     252284

Then, I also checked today if there could be other problems and there are.
sa-learn dump magic:
0.000          0          3          0  non-token data: bayes db version
0.000          0      21714          0  non-token data: nspam
0.000          0       6950          0  non-token data: nham
0.000          0     784903          0  non-token data: ntokens
0.000          0 1052059392          0  non-token data: oldest atime
0.000          0 1096983312          0  non-token data: newest atime
0.000          0 1096983409          0  non-token data: last journal sync atime
0.000          0 1094773234          0  non-token data: last expiry atime
0.000          0   29754654          0  non-token data: last expire atime delta
0.000          0         36          0  non-token data: last expire reduction 
count

sa-learn --force-expire excerpt:
debug: bayes: expiry check keep size, 0.75 * max: 750000
debug: bayes: token count: 784903, final goal reduction size: 34903
debug: bayes: First pass?  Current: 1096983709, Last: 1094773234, atime: 
29754654, count: 36, newdelta: 30689, ratio: 969.527777777778, period: 43200
debug: bayes: Can't use estimation method for expiry, something fishy, 
calculating optimal atime delta (first pass)
debug: bayes: expiry max exponent: 9
debug: bayes: atime     token reduction
debug: bayes: ========  ===============
debug: bayes: 43200     781347
debug: bayes: 86400     776905
debug: bayes: 172800    771548
debug: bayes: 345600    764197
debug: bayes: 691200    751936
debug: bayes: 1382400   731870
debug: bayes: 2764800   689944
debug: bayes: 5529600   657689
debug: bayes: 11059200  649542
debug: bayes: 22118400  545825
debug: bayes: couldn't find a good delta atime, need more token difference, 
skipping expire.

So, this problem is still there in SA 3. Of course, this could also be the 
cause of the time-outs, but how come the big bayes_journal files into play? 
Also, I think if the expiry causes the time-out it should happen *each* time a 
message is processed, until the expiry succeeds (which it can't) - or not?
So, this might be indeed the core of the problem or just coincidence.
As for the expire problem, if I understand this correctly, it seems to happen 
because it stops after ten tries. Going back 22118400 is January 2004, but 
apparently there are still 500.000 tokens before that date (the oldest data is 
from May 2003). If I'm correct the problem only occurs because we have chosen 
to maintain a bayes db for a longer time, so that no expiry was done at all 
for a long time. And then, when the expiry is about to happen it doesn't work 
because the tokens are spread over a wide time range. I checked others of our 
dbs (all slightly under the expiry threshold) and found the same problem. When 
I reduced the expiry threshold from 750.000 to 100.000 on one of them the 
expiry worked because the tenth try was about to fit. Of course, that shrunk 
it from 715.000 to 162.000. I still think this is a basic flaw in the way 
bayes expiry works (we already discussed this early this year on the users 
mailing lists.), because if I'm right anyone who wants to keep a bigger db 
over a longer time period is going to have this trouble.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to