[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 07:19 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 12:02:32AM -0700, [EMAIL PROTECTED] wrote:
 I wonder if there is any relationship between creating random huge journals 
 and SA randomly growing suddenly to 250MB+.

BTW: for someone experiencing this issue, I'd be interested in getting a
copy (compressed of course!) of the extreme sized journal, and/or seeing
the output of ls -las, stat, etc.

There's only 1 way I can think of for a journal file to instantly
grow to a large size (in normal usage, it will grow at a rate relative
to the amount of mail being processed).  When the journal write occurs,
and a failure is detected, the code will truncate() the file back to
the original size before the write (see BayesStore/DBM::cleanup()).
The truncate() is accompanied by a warning:

bayes: partial write to bayes journal $path ($len of $nbytes), recovering

If this process somehow gets screwed up, the truncate could actually make
the file larger and create what is known as a sparse file.  ie: a bunch of
data, a bunch of nothing (the OS typically inserts nulls when reading from
the section that doesn't actually exist on disk), and a bunch of data.

This behavior is new in 3.0.0, 2.6 would detect a partial write of journal
data, internally jump ahead to the part that wasn't written, and try again.
This could potentially lead to multiple writers clobbering each other, which
could still happen for a partial write, but at least the journal file should
be truncated() to a known good state.

For examples of creating spare files:

$ perl -e 'open(T, foo); print T hi; truncate(T,256*1048576); close(T);'
$ ls -las foo
   4 -rw-r--r--1 tvd  wheel268435456 Oct  9 09:47 foo
$ stat foo
  File: `foo'
  Size: 268435456   Blocks: 8  IO Block: 4096   Regular File
Device: 803h/2051d  Inode: 212997  Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1113/ tvd)   Gid: (   10/   wheel)
Access: 2004-10-09 09:47:55.0 -0400
Modify: 2004-10-09 09:47:55.0 -0400
Change: 2004-10-09 09:47:55.0 -0400

So the file only actually has a small number of blocks used on disk
to store hi\n (see below for comments about the actual size usage),
but the filesystem reports the file as 256MB.

In contrast:

$ perl -e 'open(T, foo); print T hix(128*1048576); close(T);'
$ ls -las foo
262404 -rw-r--r--1 tvd  wheel268435456 Oct  9 09:50 foo
$ stat foo
  File: `foo'
  Size: 268435456   Blocks: 524808 IO Block: 4096   Regular File
Device: 803h/2051d  Inode: 212997  Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1113/ tvd)   Gid: (   10/   wheel)
Access: 2004-10-09 09:47:55.0 -0400
Modify: 2004-10-09 09:50:52.0 -0400
Change: 2004-10-09 09:50:52.0 -0400

This file actually has 256MB of data in it.  Notice that the space actually
used on disk is much higher.

(the ls vs stat output may be a little confusing wrt blocks.  ls reports
1k blocks, stat reports 512 byte blocks, and the actual file system block
size (smallest amount of allocatable space in the FS) is 4096 bytes.
so in the truncate version, there is only 1 FS block allocated for the
file (4 x 1k == 1 x 4k), the bottom version has 65601 FS blocks allocated
(262404 x 1k blocks == 65601 x 4k blocks)).





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 07:54 ---
Sorry, I should have mentioned that I have copies of most of the occurences. 
However, I don't know if the block usage remains the same when I make a cp -a 
dir dir.new. I tar-gzipped the following dir, it's now 11 MB. Do you want to 
have it? Can I upload it to an FTP or shall I provide it to you via ftp or scp?

Note that this is happening with SA 3.0 used by MailScanner, this is not a 
spamd issue, at least as far as I experience it. I have MailScanner+3.0 setup 
on two machines (our own mailserver and its backup MX), it occurs only on the 
main mailserver. Both are setup almost identical, the backup MX gets much less 
mail, though (about 10% of the main mailserver) and it's only spam. It didn't 
happen with MailScanner+2.63. 
(All other machines are milter+spamd2.63 because I'm reluctant to try spamd 
3.0 with the memory problems getting reported. On one of the spamd machines 
(with the most mail traffic supposedly) I sometimes see one or two spamd 
processes collect massive amounts of memory (can be 1 GB), but this happens 
very rarily, less than once per month. If it is not detected early enough it 
crashes the machine eventually. I don't know of a way to limit this, AFAIK 
ulimit applies to logins and this won't work for a daemon, correct?)

Here's the output for ls and stat as mentioned above.
n8:/home/spamd/bayes.4 # ls -las
total 305248
 4 drwxrwsr-x2 spamdwww  4096 Oct  8 16:54 .
 4 drwxr-xr-x   14 spamdmail 4096 Oct  8 17:01 ..
 4 -rw---1 root www16 Oct  8 16:49 bayes.lock
 4 -rw-rw-rw-1 spamdwww42 Oct  5 15:41 bayes.mutex
 46116 -rw-rw-rw-1 root www  47168872 Oct  8 16:54 bayes_journal
202004 -rw-rw-rw-1 root www  206639592 Oct  8 16:32 
bayes_journal.old
  1544 -rw-rw-rw-1 spamdwww   2482176 Oct  8 16:04 bayes_seen
 16420 -rw-rw-rw-1 spamdwww  20951040 Oct  8 16:50 bayes_toks
 4 -rwxr-xr-x1 spamdwww  3499 Apr 12 18:14 create_bayes.pl
 39136 -rw-r--r--1 root www  40030104 Oct  5 16:25 dump.txt
 4 -rwx--1 spamdwww   943 Apr 12 19:34 stripbadtoken.pl
 4 -rwx--1 spamdwww   229 Apr 12 19:06 timetest.pl

n8:/home/spamd/bayes.4 # stat bayes_journal
  File: `bayes_journal'
  Size: 47168872Blocks: 92232  IO Block: 4096   regular file
Device: 802h/2050d  Inode: 785771  Links: 1
Access: (0666/-rw-rw-rw-)  Uid: (0/root)   Gid: (8/ www)
Access: 2004-10-08 16:54:09.0 +0200
Modify: 2004-10-08 16:54:17.0 +0200
Change: 2004-10-08 17:01:07.0 +0200

n8:/home/spamd/bayes.4 # stat bayes_journal.old
  File: `bayes_journal.old'
  Size: 206639592   Blocks: 404008 IO Block: 4096   regular file
Device: 802h/2050d  Inode: 785775  Links: 1
Access: (0666/-rw-rw-rw-)  Uid: (0/root)   Gid: (8/ www)
Access: 2004-10-08 16:51:03.0 +0200
Modify: 2004-10-08 16:32:12.0 +0200
Change: 2004-10-08 17:01:46.0 +0200




--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 08:54 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 07:54:27AM -0700, [EMAIL PROTECTED] wrote:
 However, I don't know if the block usage remains the same when I make a cp 
 -a 
 dir dir.new. I tar-gzipped the following dir, it's now 11 MB. Do you want to 

Hrm.  I believe the issue is whether cp understands sparse files.  The Linux
cp I have (it looks like you're using Linux) seems to say it possibly supports
sparse files based on a crude heuristic to determine if the file is sparse
or not.  So to be safe, I wouldn't trust cp.

 have it? Can I upload it to an FTP or shall I provide it to you via ftp or 
 scp?

If you can make it available, I'll grab it from you.  I can make some ftp
space available if that's easier.

 crashes the machine eventually. I don't know of a way to limit this, AFAIK 
 ulimit applies to logins and this won't work for a daemon, correct?)

ulimit applies to processes and their children.  logins are simply a shell
with children procs. ;)  (BTW: ulimit -c 0 is great for httpd and such to
prevent core files being written upon crash...)

  46116 -rw-rw-rw-1 root www  47168872 Oct  8 16:54 bayes_journal
 202004 -rw-rw-rw-1 root www  206639592 Oct  8 16:32 
 bayes_journal.old

   File: `bayes_journal'
   Size: 47168872Blocks: 92232  IO Block: 4096   regular file
 
   File: `bayes_journal.old'
   Size: 206639592   Blocks: 404008 IO Block: 4096   regular file

Both of these seem to be non-sparse.  In a cp version of the .old file,
I'd look for a bunch of text, then a bunch of nulls, then potentially
a bunch more text.





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 09:23 ---
Thanks for the hint. I had a look at the copied bayes_journal and 
bayes_journal.old files. They seem to contain stuff the normal bayes_journal 
doesn't have. F.i. the bayes_journal of size 47168872 contains over 800.000 
lines. Some of them look like the stuff I see in a normal bayes_journal (but 
not exactly), but 813090 of the lines have an sa_generated. That's obviously 
what blows them to the sky.

Examples:
normal bayes_journal:
a few dozen or hundred lines like this:
t 1097337643 32578a6c95

blown_up bayes_journal:
99% of lines like this:
m h [EMAIL PROTECTED]
m s [EMAIL PROTECTED]
m s [EMAIL PROTECTED]
(it seems to start with thousands of these and eventually finishes with a mix 
of the following)

the last 300 lines of the file are like this:
(... indicates I skipped lines of the same type)
...
t 1097247243 9826be506b
n 0 1
c 0 1 1097247243 65cb22ebd8
...
c 0 1 1097247243 850959cf3e
m h [EMAIL PROTECTED]
n 0 1
c 0 1 1097247243 65cb22ebd8
...
c 0 1 1097247243 850959cf3e
m h [EMAIL PROTECTED]
EOF

Does that already give you enough information or shall I still provide the 
complete files?



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 09:32 ---
In addition to my last comment, I only noticed that now. The vast amount of 
those sa_generated lines seems to come in doubles. F.i. the file starts with

m h [EMAIL PROTECTED]
m h [EMAIL PROTECTED]
m h [EMAIL PROTECTED]
m h [EMAIL PROTECTED]
m s [EMAIL PROTECTED]
m s [EMAIL PROTECTED]
m s [EMAIL PROTECTED]
m s [EMAIL PROTECTED]

and seems to go on this way almost up to the end.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 10:28 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 09:32:01AM -0700, [EMAIL PROTECTED] wrote:
 m h [EMAIL PROTECTED]
 m h [EMAIL PROTECTED]
 m h [EMAIL PROTECTED]
 m h [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 
 and seems to go on this way almost up to the end.

Hrm.  Those are message learn commands (m).  Basically it says h/s for
ham/spam, and then the message-id, which are all sha1 hash generated
based on headers.

Shouldn't be multiples together, and there should be token information
before it (iirc, n # # (change ham/spam count), c # # token atime (learn
token ham/spam, atime), m h/s msgid (message learned, ham/spam, msgid to
avoid double learning, etc.), in that order...)

Having the same entry multiple times (total) is ok, btw.  sa-learn doesn't
know a message was learned already until the journal is synced, so the
same message can appear multiple times.





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 10:38 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 09:23:55AM -0700, [EMAIL PROTECTED] wrote:
 lines. Some of them look like the stuff I see in a normal bayes_journal (but 
 not exactly), but 813090 of the lines have an sa_generated. That's 
 obviously 
 what blows them to the sky.

that's normal.

 normal bayes_journal:
 a few dozen or hundred lines like this:
 t 1097337643 32578a6c95

those are token atime updates.  happens during scanning.

 blown_up bayes_journal:
 99% of lines like this:
 m h [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 m s [EMAIL PROTECTED]
 (it seems to start with thousands of these and eventually finishes with a mix 
 of the following)

Hrm.

 the last 300 lines of the file are like this:
 (... indicates I skipped lines of the same type)
 ...
 t 1097247243 9826be506b
 n 0 1
 c 0 1 1097247243 65cb22ebd8
 ...
 c 0 1 1097247243 850959cf3e
 m h [EMAIL PROTECTED]

yeah, that's normal for learning to journal.

 Does that already give you enough information or shall I still provide the 
 complete files?

I'd like to get the file if I can.  If there's actually lines and no
nulls, it's not a truncate issue, so that's good.  I'd like to see what
they actually look like though.





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 11:39 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 11:10:07AM -0700, [EMAIL PROTECTED] wrote:
 n (1 line)
 c (many)
 m (1 line)
 t (many)
 here it should then start over with n.

well, the ncm (learn) and t (scan) are different operations, but
otherwise, yeah.

 file maybe took an hour or less. Could be a problem with MailScanner - SA 
 interaction, could it?

Possibly, it's hard to say.  This is part of my problem with bugs
reported when not using our provided code -- it's pretty hard for us to
debug/reproduce issues in code we didn't write, we don't use, etc.

I don't see how multiple m lines from the same message would go into the
journal without trying to learn the same message multiple times.





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 14:56 ---
Subject: Re:  SA 3.0 creates randomly extreme big bayes_journal

On Sat, Oct 09, 2004 at 01:38:56PM -0400, Theo Van Dinter wrote:
 I'd like to get the file if I can.  If there's actually lines and no
 nulls, it's not a truncate issue, so that's good.  I'd like to see what
 they actually look like though.

Hrm.  I got a copy of an errored journal file:

   1096 c
3562112 m
  4 n
266 t

so yeah, the problem is the 3.5 million m lines.  They do, by in large, all
look duplicated for some reason.  So the lines indicates ~1.8 million mails
learned, but no corresponding ham/spam count updates or tokens, which is just
wrong.  I don't think you learned 1.8 million mails anyway.

There is a ton of duplication.  Individual msgid and repeat counts,
respectively:

106 33120
16  3312
1   736
4   184
2   72
10  8
3   4
2   2

So something's up.  I haven't seen this issue in normal spamd-type usage, so
I'm tempted to blame MailScanner...

However, looking at the code, I found something that seems odd, and
could very well cause the issue.  In fact, allow me to go: OMG!

In 2.6x and 3.0, the sync_journal function (takes the journal data and
updates the databases) calls seen_put (and seen_delete) to take the
message id and store it in the database.  BUT!  seen_put and seen_delete
check to see if learn_to_journal is set, and if so, defers the update
to the journal!

OMG OMG OMG!  I can even reproduce this in normal SA mode!

If you use sa-learn --sync to sync the journal, the problem doesn't exist,
for some reason.

If you let auto-sync occur during SA runs, the behavior happens due to the
reason above.  I got the behavior by setting:

bayes_learn_to_journal 1
bayes_journal_max_size 1

then shoving messages through causes the problem to occur, and it actually
keeps adding the same message over and over as well since seen is never
updated.

OMG!  So the easy solution is to either have a special sync seen_put and
seen_delete, or kluge the learn_to_journal setting around the calls.  I think
the first is the right solution.  Patch forthcoming, then please test it for
me. :)

I have no idea how this hasn't been seen before.  This code has been for ages.
The m code was new to 2.6, so it's been over a year.  geez!





--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872

[EMAIL PROTECTED] changed:

   What|Removed |Added

 CC||dev@spamassassin.apache.org
 AssignedTo|dev@spamassassin.apache.org |[EMAIL PROTECTED]
   Severity|normal  |major
   Priority|P5  |P3
   Target Milestone|Future  |3.0.1



--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 15:02 ---
taking the ticket, moving to 3.0.1 queue, etc.



--- You are receiving this mail because: ---
You are the assignee for the bug, or are watching the assignee.
You are on the CC list for the bug, or are watching someone who is.


[Bug 3872] SA 3.0 creates randomly extreme big bayes_journal

2004-10-09 Thread bugzilla-daemon
http://bugzilla.spamassassin.org/show_bug.cgi?id=3872





--- Additional Comments From [EMAIL PROTECTED]  2004-10-09 15:11 ---
Created an attachment (id=2439)
 -- (http://bugzilla.spamassassin.org/attachment.cgi?id=2439action=view)
suggested patch

I'm still stunned about this!  OMG!

Anyway, here's my patch.  It makes a new seen_{put,delete}_direct function that
is called directly from the sync_journal function.  the normal
seen_{put,delete} is still available for the public API, and calls *_direct
appropriately which I consider a private API just usable by the sync_journal
function.



--- You are receiving this mail because: ---
You are on the CC list for the bug, or are watching someone who is.